VICReg Our Model
Model Description
SODA-VEC embedding model trained with VICReg Our loss function. This model uses normalized embeddings with covariance, feature, and dot product losses (diagonal-only) to learn biomedical text representations.
This model is part of the SODA-VEC (Scientific Open Domain Adaptation for Vector Embeddings) project, which focuses on creating high-quality embedding models for biomedical and life sciences text.
Key Features:
- Trained on 26.5M biomedical title-abstract pairs from PubMed Central
- Based on ModernBERT-base architecture
- Optimized for biomedical text similarity and semantic search
- Produces 768-dimensional embeddings with mean pooling
Training Details
Training Data
- Dataset:
EMBO/soda-vec-data-full_pmc_title_abstract_paired - Size: 26,473,900 training pairs
- Source: Complete PubMed Central baseline (July 2024)
- Format: Paired title-abstract examples optimized for contrastive learning
Training Procedure
Loss Function: VICReg Our: normalized embeddings with covariance loss, feature loss, and dot product loss (diagonal-only)
We have implemented a series of changes from the original VICREG in the paper from Meta. Here we show the main differences:
| Feature | Original VICReg | VICReg Our | VICReg Our Contrast |
|---|---|---|---|
| Normalization | No | Yes (L2-normalized) | Yes (L2-normalized) |
| Invariance (MSE) | Yes | No | No |
| Variance (hinge) | Yes | No | No |
| Covariance | Yes (unnormalized) | Yes (normalized) | Yes (normalized) |
| Feature correlation | No | Yes (cross-view) | Yes (cross-view) |
| Sample similarity | No | Yes (diagonal only) | Yes (diagonal + off-diagonal) |
Coefficients: cov=1.0, feature=1.0, dot=1.0
Base Model: answerdotai/ModernBERT-base
Training Configuration:
- GPUs: 4
- Batch Size per GPU: 16
- Gradient Accumulation: 4
- Effective Batch Size: 256
- Learning Rate: 2e-05
- Warmup Steps: 100
- Pooling Strategy: mean
- Epochs: 1 (full dataset pass)
Training Command:
python scripts/soda-vec-train.py --config vicreg_our --coeff_cov 1 --coeff_feature 1 --coeff_dot 1 --push_to_hub --hub_org EMBO --save_limit 5
Model Architecture
- Base Architecture: ModernBERT-base (12 layers, 768 hidden size)
- Pooling: Mean pooling over token embeddings
- Output Dimension: 768
- Normalization: L2-normalized embeddings (for VICReg-based models)
Usage
Using Sentence-Transformers
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer("EMBO/vicreg_our")
# Encode sentences
sentences = [
"CRISPR-Cas9 gene editing in human cells",
"Genome editing using CRISPR technology"
]
embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")
# Compute similarity
from sentence_transformers.util import cos_sim
similarity = cos_sim(embeddings[0], embeddings[1])
print(f"Similarity: {similarity.item():.4f}")
Using Hugging Face Transformers
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("EMBO/vicreg_our")
model = AutoModel.from_pretrained("EMBO/vicreg_our")
# Encode sentences
sentences = [
"CRISPR-Cas9 gene editing in human cells",
"Genome editing using CRISPR technology"
]
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# Mean pooling
embeddings = outputs.last_hidden_state.mean(dim=1)
# Normalize (for VICReg models)
embeddings = F.normalize(embeddings, p=2, dim=1)
# Compute similarity
similarity = F.cosine_similarity(embeddings[0:1], embeddings[1:2])
print(f"Similarity: {similarity.item():.4f}")
Evaluation
The model has been evaluated on comprehensive biomedical benchmarks including:
- Journal-Category Classification: Matching journals to BioRxiv subject categories
- Title-Abstract Similarity: Discriminating between related and unrelated paper pairs
- Field-Specific Separability: Distinguishing between different biological fields
- Semantic Search: Retrieval quality on biomedical text corpora
For detailed evaluation results, see the SODA-VEC benchmark notebooks.
Intended Use
This model is designed for:
- Biomedical Semantic Search: Finding relevant papers, abstracts, or text passages
- Scientific Text Similarity: Computing similarity between biomedical texts
Limitations
- Domain Specificity: Optimized for biomedical and life sciences text; may not perform as well on general domain text
- Language: English only
- Text Length: Optimized for titles and abstracts; longer documents may require chunking
- Bias: Inherits biases from the training data (PubMed Central corpus)
Citation
If you use this model, please cite:
@software{soda_vec,
title = {SODA-VEC: Scientific Open Domain Adaptation for Vector Embeddings},
author = {EMBO},
year = {2024},
url = {https://github.com/source-data/soda-vec}
}
Model Card Contact
For questions or issues, please open an issue on the SODA-VEC GitHub repository.
Model Card Generated: 2025-11-10
- Downloads last month
- 63
Model tree for EMBO/vicreg_our
Base model
answerdotai/ModernBERT-base