Model Card for BioCAP

BioCAP is a foundation model for biology organismal images. It is trained on TreeOfLife-10M with synthetic captions (TreeOfLife-10M-Captions) as supervision on the basis of a CLIP model (ViT-B/16) pre-trained by OpenAI. BioCAP achieves state-of-the-art performance on text-image retrieval tasks.

Model Details

Model Description

Foundation models trained on large-scale biological data can benefit from richer multimodal supervision beyond taxonomic labels. BioCAP extends BioCLIP by incorporating fine-grained synthetic captions and introducing dual visual projectors to better align images with both taxonomic and descriptive signals. Trained on the TreeOfLife-10M dataset augmented with trait-focused synthetic captions (TreeOfLife-10M-Captions), BioCAP achieves significant improvements across multiple biological tasks. Compared with BioCLIP, BioCAP improves zero-shot species classification by 8.8% and biological text-image retrieval by 21.3%, demonstrating the effectiveness of integrating descriptive, biologically grounded captions as complementary supervision for fine-grained multimodal learning.

  • Developed by: Ziheng Zhang, Xinyue Ma, Arpita Chowdhury, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Samuel Stevens, Hilmar Lapp, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao, Jianyang Gu
  • Model type: The model uses a ViT-B/16 Transformer as an image encoder and uses a masked self-attention Transformer as a text encoder.
  • License: MIT
  • Fine-tuned from model: OpenAI CLIP, ViT-B/16 (Model weight)

Model Sources

Uses

Direct Use

The model can be used for zero-shot classification given species names. It can also be applied to text–image retrieval, aligning biological images with descriptive queries. Additionally, it can support other language-related tasks that require grounding biological images in natural language.

Bias, Risks, and Limitations

BioCAP is trained on images from the TreeOfLife-10M dataset, which exhibits a long-tailed distribution across taxa. As a result, the predictions of BioCAP may be biased toward well-represented species.

BioCAP and TreeOfLife-10M paired with TreeOfLife-10M-Captions provide strong potential to support biodiversity research and conservation, especially by facilitating recognition and monitoring of species at scale. However, as with many open-source tools, there are potential risks if misused. For example, improved recognition of rare or threatened species could theoretically aid poachers. At the same time, these same capabilities can serve as a force multiplier for conservation, enabling more effective monitoring of illicit trade and improving protection efforts.

Importantly, the dataset used to train BioCAP does not include geo-tagged location data, thereby reducing risks of misuse related to disclosing precise species habitats.

How to Get Started with the Model

You can use the open_clip library to load BioCAP.

import open_clip
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:imageomics/biocap')
tokenizer = open_clip.get_tokenizer('hf-hub:imageomics/biocap')

Training Details

Training Data

This model was trained on TreeOfLife-10M, which is a compilation of images matched to Linnaean taxonomic rank from kingdom through species. They are also matched with common (vernacular) name of the subject of the image where available. In addition, we augment the dataset with fine-grained synthetic captions (TreeOfLife-10M-Captions), automatically generated from domain-specific contexts (Wikipedia-derived traits and taxon-tailored format examples) to provide descriptive, biologically grounded supervision. For more information, please see our dataset, TreeOfLife-10M-Captions.

Training Procedure

Preprocessing

Standard CLIP image preprocessing is adopted in the training.

Training Hyperparameters

  • Training regime: bf16 mixed precision

We used an Adam optimizer with a maximum learning rate of 1e-4. 500 warming steps were adopted, followed by cosine decay. The batch size of images was 4,096 per GPU. We trained the model on 8 GPUs for 50 epochs, with a weight decay of 0.2. Each input image was resized to 224 x 224 resolution.

Evaluation

We evaluated the model on zero-shot species classification, text–image retrieval, and INQUIRE-rerank.

Testing Data

For species classification tasks, we tested BioCAP on the following 10 tasks:

  • NABirds: We used 555 visual categories of 48,640 images for test.
  • Meta-Album: We used the Plankton, Insects, Insects2, PlantNet, Fungi, PlantVillage, and Medicinal Leaf datasets from Meta-Album.
  • IDLE-OO Camera Traps: Species identification in camera trap images is a real-world scenario that BioCAP can be applied to. This dataset contains a class-balanced test sets from five LILA-BC camera trap datasets. For more information on this test set, please visit the dataset page.
  • Rare Species: This dataset was introduced in the first BioCLIP paper. It consists of 400 species labeled Near Threatened through Extinct in the Wild by the IUCN Red List, with 30 images per species. Top-1 accuracy is reported for both zero-shot and few-shot experiments.

For text-image retrieval tasks, we used:

  • INQUIRE: A benchmark designed to assess fine-grained retrieval and reranking performance. We used the rerank protocol, where the model must reorder 100 initially retrieved candidate images per query so that relevant ones are ranked higher.
  • Cornell Bird: A paired image–text dataset we collected from the Macaulay Library. It contains naturalistic bird photographs paired with descriptive text.
  • PlantID: A paired dataset we collected from PlantID. It provides plant photographs and associated textual descriptions for evaluating retrieval in botanical domains.

Note: More details regarding the evaluation implementation can be referred to in the paper. Dataset access code and the CSVs for the last two text-image retrieval tasks are provided in the evaluation section of the BioCAP Pipeline.

Results

We show the zero-shot classification and text-image retrieval task results here. For more detailed results, please check the paper.

Model Animals Plants & Fungi Rare Species Mean
NABirds Plankton Insects Insects 2 Camera Trap PlantNet Fungi PlantVillage Med. Leaf
CLIP (ViT-B/16) 39.0 3.3 7.4 9.3 28.1 52.5 8.6 5.1 15.0 25.7 19.4
SigLIP 50.2 3.7 17.6 9.6 26.7 76.3 28.3 26.1 45.4 30.7 32.3
FG-CLIP 48.3 1.9 6.9 9.3 26.4 55.6 7.3 5.9 15.7 29.4 20.7
BioTrove-CLIP 39.4 1.0 20.5 15.7 10.7 64.4 38.2 15.7 31.6 24.6 26.2
BioCLIP 58.8 6.1 34.9 20.5 31.7 88.2 40.9 19.0 38.5 37.1 37.6
BioCAP (Ours) 67.6 7.2 41.9 23.7 37.4 93.6 64.4 33.0 51.4 44.2 46.4
Model INQUIRE Rerank Cornell Bird PlantID Mean
Appear. Behav. Context Species I2T T2I I2T T2I
CLIP (ViT-B/16) 30.8 32.9 37.2 37.1 33.8 29.1 25.0 22.1 31.0
SigLIP 34.6 37.2 41.4 36.2 47.7 50.2 42.1 38.1 40.9
FG-CLIP 28.8 31.1 32.5 41.0 49.4 48.1 28.7 27.4 35.9
BioTrove-CLIP 28.5 22.2 30.5 39.5 16.5 13.8 47.4 50.1 31.1
BioCLIP 27.4 27.2 30.8 41.1 15.1 16.2 47.8 45.0 31.3
BioCAP (Ours) 37.1 33.6 37.0 43.0 54.0 52.0 81.4 83.0 52.6

Summary

BioCAP surpasses BioCLIP by 8.8% on zero-shot species classification benchmarks. Although the model is primarily trained to align images with taxonomic labels and synthetic captions, it also achieves strong performance on tasks beyond species classification. Notably, BioCAP outperforms BioCLIP by 21.3% on biological text–image retrieval, demonstrating its effectiveness as a multimodal foundation model for biology.

Technical Specifications

Compute Infrastructure

The training was performed on 8 NVIDIA H100-80GB GPUs distributed over 2 nodes on the Ohio Supercomputing Center's Cardinal Cluster. It took 30hrs to complete the training of 50 epochs.

Citation

Model:

@software{Zhang_BioCAP_model,
  author = {Ziheng Zhang and Xinyue Ma and Arpita Chowdhury and Elizabeth G Campolongo and Matthew J Thompson and Net Zhang and Samuel Stevens and Hilmar Lapp and Tanya Berger-Wolf and Yu Su and Wei-Lun Chao and Jianyang Gu},
  license = {MIT},
  title = {{BioCAP} (Revision af8db7a)},
  url = {https://huggingface.co/imageomics/biocap},
  version = {1.0.0},
  doi = {10.57967/hf/6799},
  publisher = {Hugging Face},
  year = {2025}
}

Please also cite our paper:

@article{zhang2025biocap,
  title    = {Bio{CAP}: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models},
  author   = {Ziheng Zhang and Xinyue Ma and Arpita Chowdhury and Elizabeth G Campolongo and Matthew J Thompson and Net Zhang and Samuel Stevens and Hilmar Lapp and Tanya Berger-Wolf and Yu Su and Wei-Lun Chao and Jianyang Gu},
  year     = {2025},
  eprint   = {2510.20095},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2510.20095}
}

Also consider citing OpenCLIP and BioCLIP:

@software{ilharco_gabriel_2021_5143773,
  author={Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig},
  title={OpenCLIP},
  year={2021},
  doi={10.5281/zenodo.5143773},
}

Original BioCLIP Model:

@software{bioclip2023,
  author = {Samuel Stevens and Jiaman Wu and Matthew J. Thompson and Elizabeth G. Campolongo and Chan Hee Song and David Edward Carlyn and Li Dong and Wasila M. Dahdul and Charles Stewart and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
  doi = {10.57967/hf/1511},
  month = nov,
  title = {BioCLIP},
  version = {v0.1},
  year = {2023}
}

Original BioCLIP Paper:

@inproceedings{stevens2024bioclip,
  title = {{B}io{CLIP}: A Vision Foundation Model for the Tree of Life}, 
  author = {Samuel Stevens and Jiaman Wu and Matthew J Thompson and Elizabeth G Campolongo and Chan Hee Song and David Edward Carlyn and Li Dong and Wasila M Dahdul and Charles Stewart and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2024},
  pages = {19412-19424}
}

Acknowledgements

We would like to thank Wasila Dahdul, Zhiyuan Tao, Yifan Liu, Fangxun Liu, Shuheng Wang, Ziqi Li, David Carlyn, Quang-Huy Nguyen, Yintie Lei, and Junke Yang for their help with the human evaluation, and the Imageomics Team for their constructive feedback.

We also gratefully acknowledge the use of paired text–image data from PlantID and the Cornell Bird Macaulay Library for retrieval evaluation.

This work was supported by the Imageomics Institute, which is funded by the US National Science Foundation's Harnessing the Data Revolution (HDR) program under Award #2118240 (Imageomics: A New Frontier of Biological Information Powered by Knowledge-Guided Machine Learning).

Our research is also supported by resources from the Ohio Supercomputer Center.

Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Model Card Authors

Ziheng Zhang

Model Card Contact

zhang.13617@osu.edu

Downloads last month
36
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train imageomics/biocap

Collection including imageomics/biocap