VICReg Exact Model

Model Description

SODA-VEC embedding model trained with VICReg Exact loss function. This model implements the exact VICReg objective with invariance, variance, and covariance terms for biomedical text embeddings.

This model is part of the SODA-VEC (Scientific Open Domain Adaptation for Vector Embeddings) project, which focuses on creating high-quality embedding models for biomedical and life sciences text.

Key Features:

Trained on 26.5M biomedical title-abstract pairs from PubMed Central
Based on ModernBERT-base architecture
Optimized for biomedical text similarity and semantic search
Produces 768-dimensional embeddings with mean pooling

Training Details

Training Data

Dataset: EMBO/soda-vec-data-full_pmc_title_abstract_paired
Size: 26,473,900 training pairs
Source: Complete PubMed Central baseline (July 2024)
Format: Paired title-abstract examples optimized for contrastive learning

Training Procedure

Loss Function: VICReg Exact: exact VICReg objective with invariance (MSE), variance (std), and covariance losses

Coefficients: sim=25.0, std=25.0, cov=1.0 Base Model: answerdotai/ModernBERT-base

Training Configuration:

GPUs: 4
Batch Size per GPU: 16
Gradient Accumulation: 4
Effective Batch Size: 256
Learning Rate: 2e-05
Warmup Steps: 100
Pooling Strategy: mean
Epochs: 1 (full dataset pass)

Training Command:

python scripts/soda-vec-train.py --config vicreg_exact --coeff_sim 25 --coeff_std 25 --coeff_cov 1 --push_to_hub --hub_org EMBO --save_limit 5

Model Architecture

Base Architecture: ModernBERT-base (12 layers, 768 hidden size)
Pooling: Mean pooling over token embeddings
Output Dimension: 768
Normalization: L2-normalized embeddings (for VICReg-based models)

Usage

Using Sentence-Transformers

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("EMBO/vicreg_exact")

# Encode sentences
sentences = [
    "CRISPR-Cas9 gene editing in human cells",
    "Genome editing using CRISPR technology"
]

embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")

# Compute similarity
from sentence_transformers.util import cos_sim
similarity = cos_sim(embeddings[0], embeddings[1])
print(f"Similarity: {similarity.item():.4f}")

Using Hugging Face Transformers

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("EMBO/vicreg_exact")
model = AutoModel.from_pretrained("EMBO/vicreg_exact")

# Encode sentences
sentences = [
    "CRISPR-Cas9 gene editing in human cells",
    "Genome editing using CRISPR technology"
]

inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
    
# Mean pooling
embeddings = outputs.last_hidden_state.mean(dim=1)

# Normalize (for VICReg models)
embeddings = F.normalize(embeddings, p=2, dim=1)

# Compute similarity
similarity = F.cosine_similarity(embeddings[0:1], embeddings[1:2])
print(f"Similarity: {similarity.item():.4f}")

Evaluation

The model has been evaluated on comprehensive biomedical benchmarks including:

Journal-Category Classification: Matching journals to BioRxiv subject categories
Title-Abstract Similarity: Discriminating between related and unrelated paper pairs
Field-Specific Separability: Distinguishing between different biological fields
Semantic Search: Retrieval quality on biomedical text corpora

For detailed evaluation results, see the SODA-VEC benchmark notebooks.

Intended Use

This model is designed for:

Biomedical Semantic Search: Finding relevant papers, abstracts, or text passages
Scientific Text Similarity: Computing similarity between biomedical texts

Limitations

Domain Specificity: Optimized for biomedical and life sciences text; may not perform as well on general domain text
Language: English only
Text Length: Optimized for titles and abstracts; longer documents may require chunking
Bias: Inherits biases from the training data (PubMed Central corpus)

Citation

If you use this model, please cite:

@software{soda_vec,
  title = {SODA-VEC: Scientific Open Domain Adaptation for Vector Embeddings},
  author = {EMBO},
  year = {2024},
  url = {https://github.com/source-data/soda-vec}
}

Model Card Contact

For questions or issues, please open an issue on the SODA-VEC GitHub repository.

Model Card Generated: 2025-11-10

Downloads last month: 32

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for EMBO/vicreg_exact

Base model

answerdotai/ModernBERT-base

Finetuned

(972)

this model

EMBO
/

vicreg_exact