VICReg Our Model

Model Description

SODA-VEC embedding model trained with VICReg Our loss function. This model uses normalized embeddings with covariance, feature, and dot product losses (diagonal-only) to learn biomedical text representations.

This model is part of the SODA-VEC (Scientific Open Domain Adaptation for Vector Embeddings) project, which focuses on creating high-quality embedding models for biomedical and life sciences text.

Key Features:

  • Trained on 26.5M biomedical title-abstract pairs from PubMed Central
  • Based on ModernBERT-base architecture
  • Optimized for biomedical text similarity and semantic search
  • Produces 768-dimensional embeddings with mean pooling

Training Details

Training Data

Training Procedure

Loss Function: VICReg Our: normalized embeddings with covariance loss, feature loss, and dot product loss (diagonal-only)

We have implemented a series of changes from the original VICREG in the paper from Meta. Here we show the main differences:

Feature Original VICReg VICReg Our VICReg Our Contrast
Normalization No Yes (L2-normalized) Yes (L2-normalized)
Invariance (MSE) Yes No No
Variance (hinge) Yes No No
Covariance Yes (unnormalized) Yes (normalized) Yes (normalized)
Feature correlation No Yes (cross-view) Yes (cross-view)
Sample similarity No Yes (diagonal only) Yes (diagonal + off-diagonal)

Coefficients: cov=1.0, feature=1.0, dot=1.0 Base Model: answerdotai/ModernBERT-base

Training Configuration:

  • GPUs: 4
  • Batch Size per GPU: 16
  • Gradient Accumulation: 4
  • Effective Batch Size: 256
  • Learning Rate: 2e-05
  • Warmup Steps: 100
  • Pooling Strategy: mean
  • Epochs: 1 (full dataset pass)

Training Command:

python scripts/soda-vec-train.py --config vicreg_our --coeff_cov 1 --coeff_feature 1 --coeff_dot 1 --push_to_hub --hub_org EMBO --save_limit 5

Model Architecture

  • Base Architecture: ModernBERT-base (12 layers, 768 hidden size)
  • Pooling: Mean pooling over token embeddings
  • Output Dimension: 768
  • Normalization: L2-normalized embeddings (for VICReg-based models)

Usage

Using Sentence-Transformers

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("EMBO/vicreg_our")

# Encode sentences
sentences = [
    "CRISPR-Cas9 gene editing in human cells",
    "Genome editing using CRISPR technology"
]

embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")

# Compute similarity
from sentence_transformers.util import cos_sim
similarity = cos_sim(embeddings[0], embeddings[1])
print(f"Similarity: {similarity.item():.4f}")

Using Hugging Face Transformers

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("EMBO/vicreg_our")
model = AutoModel.from_pretrained("EMBO/vicreg_our")

# Encode sentences
sentences = [
    "CRISPR-Cas9 gene editing in human cells",
    "Genome editing using CRISPR technology"
]

inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
    
# Mean pooling
embeddings = outputs.last_hidden_state.mean(dim=1)

# Normalize (for VICReg models)
embeddings = F.normalize(embeddings, p=2, dim=1)

# Compute similarity
similarity = F.cosine_similarity(embeddings[0:1], embeddings[1:2])
print(f"Similarity: {similarity.item():.4f}")

Evaluation

The model has been evaluated on comprehensive biomedical benchmarks including:

  • Journal-Category Classification: Matching journals to BioRxiv subject categories
  • Title-Abstract Similarity: Discriminating between related and unrelated paper pairs
  • Field-Specific Separability: Distinguishing between different biological fields
  • Semantic Search: Retrieval quality on biomedical text corpora

For detailed evaluation results, see the SODA-VEC benchmark notebooks.

Intended Use

This model is designed for:

  • Biomedical Semantic Search: Finding relevant papers, abstracts, or text passages
  • Scientific Text Similarity: Computing similarity between biomedical texts

Limitations

  • Domain Specificity: Optimized for biomedical and life sciences text; may not perform as well on general domain text
  • Language: English only
  • Text Length: Optimized for titles and abstracts; longer documents may require chunking
  • Bias: Inherits biases from the training data (PubMed Central corpus)

Citation

If you use this model, please cite:

@software{soda_vec,
  title = {SODA-VEC: Scientific Open Domain Adaptation for Vector Embeddings},
  author = {EMBO},
  year = {2024},
  url = {https://github.com/source-data/soda-vec}
}

Model Card Contact

For questions or issues, please open an issue on the SODA-VEC GitHub repository.


Model Card Generated: 2025-11-10

Downloads last month
63
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EMBO/vicreg_our

Finetuned
(967)
this model

Dataset used to train EMBO/vicreg_our