Spaces:
Runtime error
Runtime error
| <!--Copyright 2022 The HuggingFace Team. All rights reserved. | |
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
| the License. You may obtain a copy of the License at | |
| http://www.apache.org/licenses/LICENSE-2.0 | |
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
| specific language governing permissions and limitations under the License. | |
| --> | |
| # Convolutional Vision Transformer (CvT) | |
| ## Overview | |
| The CvT model was proposed in [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808) by Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan and Lei Zhang. The Convolutional vision Transformer (CvT) improves the [Vision Transformer (ViT)](vit) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs. | |
| The abstract from the paper is the following: | |
| *We present in this paper a new architecture, named Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) | |
| in performance and efficiency by introducing convolutions into ViT to yield the best of both designs. This is accomplished through | |
| two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer | |
| block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) | |
| to the ViT architecture (\ie shift, scale, and distortion invariance) while maintaining the merits of Transformers (\ie dynamic attention, | |
| global context, and better generalization). We validate CvT by conducting extensive experiments, showing that this approach achieves | |
| state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs. In addition, | |
| performance gains are maintained when pretrained on larger datasets (\eg ImageNet-22k) and fine-tuned to downstream tasks. Pre-trained on | |
| ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7\% on the ImageNet-1k val set. Finally, our results show that the positional encoding, | |
| a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks.* | |
| Tips: | |
| - CvT models are regular Vision Transformers, but trained with convolutions. They outperform the [original model (ViT)](vit) when fine-tuned on ImageNet-1K and CIFAR-100. | |
| - You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) (you can just replace [`ViTFeatureExtractor`] by [`AutoImageProcessor`] and [`ViTForImageClassification`] by [`CvtForImageClassification`]). | |
| - The available checkpoints are either (1) pre-trained on [ImageNet-22k](http://www.image-net.org/) (a collection of 14 million images and 22k classes) only, (2) also fine-tuned on ImageNet-22k or (3) also fine-tuned on [ImageNet-1k](http://www.image-net.org/challenges/LSVRC/2012/) (also referred to as ILSVRC 2012, a collection of 1.3 million | |
| images and 1,000 classes). | |
| This model was contributed by [anugunj](https://huggingface.co/anugunj). The original code can be found [here](https://github.com/microsoft/CvT). | |
| ## Resources | |
| A list of official Hugging Face and community (indicated by π) resources to help you get started with CvT. | |
| <PipelineTag pipeline="image-classification"/> | |
| - [`CvtForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). | |
| - See also: [Image classification task guide](../tasks/image_classification) | |
| If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. | |
| ## CvtConfig | |
| [[autodoc]] CvtConfig | |
| ## CvtModel | |
| [[autodoc]] CvtModel | |
| - forward | |
| ## CvtForImageClassification | |
| [[autodoc]] CvtForImageClassification | |
| - forward | |
| ## TFCvtModel | |
| [[autodoc]] TFCvtModel | |
| - call | |
| ## TFCvtForImageClassification | |
| [[autodoc]] TFCvtForImageClassification | |
| - call | |