Model Card for DINOv3 ViT-7B/16 (FP16 Quantized)

This is a quantized (FP16) version of dinov3-vit7b16-pretrain-lvd1689m.

DINOv3 is a family of versatile vision foundation models that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models.

Model Details

This is a Vision Transformer ViT-7B/16 model trained following the method described in the DINOv3 paper and quantized to FP16 precision for reduced memory footprint and faster inference.

Quantization

Original precision: FP32
Quantized precision: FP16
Benefits: ~50% reduction in model size and memory usage, faster inference on compatible hardware

Model Description

Developed by: Meta AI (original model)
Model type: Vision Transformer (ViT-7B/16)
Original Model: dinov3-vit7b16-pretrain-lvd1689m
License: DINOv3 License

Model Sources

Repository: https://github.com/facebookresearch/dinov3
Paper: https://arxiv.org/abs/2508.10104

Uses

The model is a vision backbone providing multi-purpose features for downstream tasks.

Direct Use

The model can be used without fine-tuning, with downstream classifiers as simple as linear layers, to obtain competitive results:

on image classification, using k-NN classifiers on the class token
on image classification, with logistic regression classifiers applied on the class token
on image classification, with a linear layer applied on the class token and the average of the patch tokens
on image retrieval using nearest neighbors
on geometric and semantic 3D keypoint correspondances
on depth estimation, semantic segmentation, using linear layers
on unsupervised object discovery
on video segmentation tracking
on video classification, using a small 4-layer attentive probe

Downstream Use

While fine-tuning the model can yield some gains, it is recommended to keep this option as a last resort: the frozen features are expected to provide good performance out-of-the-box.

Bias, Risks, and Limitations

Compared to DINOv2 and SEERv2, DINOv3 delivers somewhat consistent performance across income categories on geographical fairness and diversity, although with a notable performance drop in the low-income bucket compared to the highest-income bucket.

DINOv3 also achieves relatively good scores across different regions, improving over its predecessor DINOv2. However, a relative difference is still observed between Europe and Africa.

Recommendations

Fine-tuning is expected to increase the biases in the features produced by the model as they will be tuned to the fine-tuning labels.

How to Get Started with the Model

The example below demonstrates how to obtain an image embedding with the [AutoModel] class.

Note: For FP16 models, ensure you load the model with torch_dtype=torch.float16 for optimal performance.

import torch
from transformers import AutoImageProcessor, AutoModel
from transformers.image_utils import load_image

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(url)

pretrained_model_name = "mirekphd/dinov3-vit7b16-pretrain-lvd1689m-fp16"
processor = AutoImageProcessor.from_pretrained(pretrained_model_name)
model = AutoModel.from_pretrained(
    pretrained_model_name,
    torch_dtype=torch.float16,  # Important: Load as FP16
    device_map="auto",
)

inputs = processor(images=image, return_tensors="pt").to(model.device, dtype=torch.float16)
with torch.inference_mode():
    outputs = model(**inputs)

pooled_output = outputs.pooler_output
print("Pooled output shape:", pooled_output.shape)

Training Details

Training Data

Web dataset (LVD-1689M): a curated dataset of 1,689 millions of images extracted from a large data pool of 17 billions web images collected from public posts on Instagram

Training Procedure

Training objective:

DINO self-distillation loss with multi-crop
iBOT masked-image modeling loss
KoLeo regularization on [CLS] tokens
Gram anchoring

Training regime: PyTorch FSDP2 (with bf16 and fp8 matrix multiplications)

Evaluation

Results

The reader is referred to the associated paper for details on the evaluation protocols.

Results for ViT backbones pretrained (or distilled) on web (LVD-1689M)

Note: The evaluation results below were obtained for the original FP32 models and may differ for the quantized FP16 versions.

Model	IN-ReaL	IN-R	Obj.Net	Ox.-H	ADE20k	NYU↓	DAVIS	NAVI	SPair
DINOv3 ViT-S/16	87.0	60.4	50.9	49.5	47.0	0.403	72.7	56.3	50.4
DINOv3 ViT-S+/16	88.0	68.8	54.6	50.0	48.8	0.399	75.5	57.1	55.2
DINOv3 ViT-B/16	89.3	76.7	64.1	58.5	51.8	0.373	77.2	58.8	57.2
DINOv3 ViT-L/16	90.2	88.1	74.8	63.1	54.9	0.352	79.9	62.3	61.3
DINOv3 ViT-H+/16	90.3	90.0	78.6	64.5	54.8	0.352	79.3	63.3	56.3
DINOv3 ViT-7B/16	90.4	91.1	91.1	72.8	55.9	0.309	79.7	64.4	58.7

Results for ConvNeXt backbones distilled on web (LVD-1689M)

Note: The evaluation results below were obtained for the original FP32 models and may differ for the quantized FP16 versions.

Model	IN-ReaL @256px	IN-ReaL @512px	IN-R @256px	IN-R @512px	Obj.Net @256px	Obj.Net @512px	ADE20k	NYU↓
DINOv3 ConvNeXt Tiny	86.6	87.7	73.7	74.1	52.6	58.7	42.7	0.448
DINOv3 ConvNeXt Small	87.9	88.7	73.7	74.1	52.6	58.7	44.8	0.432
DINOv3 ConvNeXt Base	88.5	89.2	77.2	78.2	56.2	61.3	46.3	0.420
DINOv3 ConvNeXt Large	88.9	89.4	81.3	82.4	59.3	65.2	47.8	0.403

Environmental Impact

Hardware Type: Nvidia H100
Hours used: 61,440 hours for ViT-7B model training
Cloud Provider: Private infrastructure
Compute Region: USA
Carbon Emitted: 18t CO2eq

Technical Specifications

Model Architecture and Objective

ViT-7B (6716M parameters):

Patch size: 16
Embedding dimension: 4096
Register tokens: 4
Heads: 32
FFN: SwiGLU
Position encoding: RoPE

For a 224x224 image, this results in 1 class token + 4 register tokens + 196 patch tokens = 201 tokens.

The model can accept larger images provided the image shapes are multiples of the patch size (16). If this condition is not verified, the model will crop to the closest smaller multiple of the patch size.

Compute Infrastructure

Hardware

Nvidia H100 GPUs

Software

PyTorch 2.7

More Information

See the blog post and the associated website.

Citation

BibTeX

@misc{simeoni2025dinov3,
  title={{DINOv3}},
  author={Sim{\'e}oni, Oriane and Vo, Huy V. and Seitzer, Maximilian and Baldassarre, Federico and Oquab, Maxime and Jose, Cijo and Khalidov, Vasil and Szafraniec, Marc and Yi, Seungeun and Ramamonjisoa, Micha{\"e}l and Massa, Francisco and Haziza, Daniel and Wehrstedt, Luca and Wang, Jianyuan and Darcet, Timoth{\'e}e and Moutakanni, Th{\'e}o and Sentana, Leonel and Roberts, Claire and Vedaldi, Andrea and Tolan, Jamie and Brandt, John and Couprie, Camille and Mairal, Julien and J{\'e}gou, Herv{\'e} and Labatut, Patrick and Bojanowski, Piotr},
  year={2025},
  eprint={2508.10104},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2508.10104},
}

Downloads last month: 12

Safetensors

Model size

7B params

Tensor type

F16

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mirekphd/dinov3-vit7b16-pretrain-lvd1689m-fp16

Base model

facebook/dinov3-vit7b16-pretrain-lvd1689m

Finetuned

(14)

this model