You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

RETFound ViT-L/16 (MAE → Transformers) — natureOCT

Author of this fork: Dávid Isztl
Upstream project: RETFound_mae_natureOCT by Yukun Zhou et al.
Paper: A foundation model for generalizable disease detection from retinal images, Nature (2023)

This repository provides a Transformers-compatible export of the RETFound MAE encoder trained on a subset of natureOCT (OCT).
It includes config.json, model.safetensors, and an AutoImageProcessor, so you can load it directly with 🤗 AutoModel / AutoModelForImageClassification.

Model Details

Model Description

This is a ViT-Large/16 encoder pretrained with the Masked Autoencoder (MAE) objective on Optical Coherence Tomography (OCT).
This fork converts the original PyTorch .pth checkpoint into a standard 🤗 Transformers format and removes MAE-only components.

Developed by (upstream): Yukun Zhou et al.
Shared by (this fork): Dávid Isztl
Model type: Vision Transformer (encoder only)
License: CC BY-NC 4.0 (inherited from upstream)
Finetuned from: Upstream RETFound MAE checkpoint (ViT-L/16)

Architecture (ViT-L/16 @ 224):

hidden_size=1024, num_hidden_layers=24, num_attention_heads=16, patch_size=16, image_size=224
add_pooling_layer=False (use CLS token or your own pooling)

Conversion notes:

Dropped MAE-only tensors: mask_token, decoder_*
Remapped fused qkv weights (timm-style) → separate Q/K/V matrices (Transformers style)
Set layer_norm_eps=1e-6 to match timm numerics
Positional embeddings sized for 224×224 (patch 16×16)

Model Sources

Repository (upstream): https://github.com/rmaphoh/RETFound
Paper: https://www.nature.com/articles/s41586-023-06555-x

Uses

Direct Use

Feature extraction from retinal images for downstream tasks
Initial encoder for transfer learning on medical imaging research tasks (e.g., classification, retrieval)

Downstream Use

Fine-tuning for image classification and related tasks using AutoModelForImageClassification
Using CLS token or pooled features in custom pipelines

Out-of-Scope Use

Clinical decision-making without proper validation and regulatory approval
Commercial use beyond the CC BY-NC 4.0 license terms

Bias, Risks, and Limitations

Trained on specific retinal data (subset of natureOCT); distribution shifts (device, population, protocol) can degrade performance.
Not a medical device; requires independent validation before any real-world or clinical deployment.
Potential biases relate to dataset composition, imaging hardware, and labeling procedures.

Recommendations

Perform task- and population-specific validation.
Monitor for domain shift; consider domain adaptation where appropriate.
Document preprocessing and augmentation pipelines for reproducibility.

How to Get Started with the Model

Feature extraction (encoder)

from transformers import AutoModel, AutoImageProcessor
from PIL import Image
import torch

repo = "iszt/RETFound_mae_natureOCT"  # this fork

processor = AutoImageProcessor.from_pretrained(repo)
model = AutoModel.from_pretrained(repo)  # ViTModel with add_pooling_layer=False
model.eval()

img = Image.open("example_retina_cfp.jpg").convert("RGB")
inputs = processor(images=img, return_tensors="pt")

with torch.no_grad():
    out = model(**inputs)
    cls = out.last_hidden_state[:, 0]        # [B, 1024] — CLS embedding after final norm
    tokens = out.last_hidden_state[:, 1:, :] # [B, N, 1024] — patch tokens

Classification fine-tune (use AutoModelForImageClassification)

from transformers import AutoConfig, AutoImageProcessor, AutoModelForImageClassification

repo = "iszt/RETFound_mae_natureOCT"
id2label = {0: "negative", 1: "positive"}  # example
label2id = {v: k for k, v in id2label.items()}

processor = AutoImageProcessor.from_pretrained(repo)

config = AutoConfig.from_pretrained(repo)
config.num_labels = len(id2label)
config.id2label = id2label
config.label2id = label2id

# Loads encoder weights from the repo and initializes a fresh classifier head
model = AutoModelForImageClassification.from_pretrained(
    repo,
    config=config,
    ignore_mismatched_sizes=True,  # replaces the classification head if shapes differ
)

# now train `model` with your dataloader/Trainer

Training Details

Training Data

Upstream pretraining: OCT from a portion of natureOCT.

Training Procedure

Objective: Masked Autoencoder (MAE) pretraining.
This fork: no additional training; checkpoint conversion only.

Preprocessing

AutoImageProcessor provided for 224×224 inputs. If your dataset uses different normalization or resolution, adjust accordingly (and, if needed, interpolate positional embeddings).

Training Hyperparameters

Not specified by upstream for this exact subset; see the paper and repository for general MAE settings.

Speeds, Sizes, Times

This fork only performs conversion; refer to upstream for compute details.

Evaluation

Testing Data, Factors & Metrics

No new evaluation performed in this fork.
For downstream tasks, report metrics relevant to the task (e.g., AUROC, accuracy, F1), and stratify by pertinent factors (device, demographics, pathology prevalence).

Results

N/A for this fork; please cite/consult upstream results for baseline pretraining performance.

Summary

Use this encoder as initialization; measure and report results on your target dataset.

Environmental Impact

This repository performs a format conversion only. Upstream pretraining compute and emissions are described in the paper and may be estimated via tools like the ML CO2 calculator.

Hardware Type: N/A (conversion only)
Hours used: N/A (conversion only)
Cloud Provider / Region: N/A
Carbon Emitted: N/A

Technical Specifications

Model Architecture and Objective

Architecture: Vision Transformer Large, patch size 16, image size 224.
Objective: MAE pretraining (encoder-only kept in this fork).
Pooling: No pooling layer (add_pooling_layer=False).

Compute Infrastructure

This fork does not introduce new training; conversion was done locally.

Hardware

N/A for conversion.

Software

Conversion used PyTorch, timm, and 🤗 Transformers.

Citation

If you use this model, please cite the original RETFound paper:

BibTeX:

@article{zhou2023foundation,
  title={A foundation model for generalizable disease detection from retinal images},
  author={Zhou, Yukun and Chia, Mark A and Wagner, Siegfried K and Ayhan, Murat S and Williamson, Dominic J and Struyven, Robbert R and Liu, Timing and Xu, Moucheng and Lozano, Mateo G and Woodward-Court, Peter and others},
  journal={Nature},
  volume={622},
  number={7981},
  pages={156--163},
  year={2023},
  publisher={Nature Publishing Group UK London}
}

APA: Zhou, Y., Chia, M. A., Wagner, S. K., Ayhan, M. S., Williamson, D. J., Struyven, R. R., … et al. (2023). A foundation model for generalizable disease detection from retinal images. Nature, 622(7981), 156–163.

Glossary

CFP: Color Fundus Photography
MAE: Masked Autoencoder
CLS token: Special token prepended to the patch sequence in ViT; often used as a global image representation.

More Information

Upstream code and instructions: https://github.com/rmaphoh/RETFound
Nature paper: https://www.nature.com/articles/s41586-023-06555-x

Model Card Authors

Dávid Isztl (fork & conversion)

Model Card Contact

For this fork/conversion: contact Dávid Isztl via Hugging Face.
For upstream model/training code: ykzhoua@gmail.com or yukun.zhou.19@ucl.ac.uk.

Downloads last month: 13

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for iszt/RETFound_mae_natureOCT

Base model

YukunZhou/RETFound_mae_natureOCT

Finetuned

(1)

this model