For example, in the above diagram, to return the feature map from the first stage of the Swin backbone, you can set out_indices=(1,): from transformers import AutoImageProcessor, AutoBackbone import torch from PIL import Image import requests url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) processor = AutoImageProcessor.from_pretrained("microsoft/swin-tiny-patch4-window7-224") model = AutoBackbone.from_pretrained("microsoft/swin-tiny-patch4-window7-224", out_indices=(1,)) inputs = processor(image, return_tensors="pt") outputs = model(**inputs) feature_maps = outputs.feature_maps Now you can access the feature_maps object from the first stage of the backbone: list(feature_maps[0].shape) [1, 96, 56, 56] AutoFeatureExtractor For audio tasks, a feature extractor processes the audio signal the correct input format.