Sci-BETO-large

Sci-BETO is a domain-specific RoBERTa encoder pretrained entirely on Spanish scientific texts.


Model Description

Sci-BETO-large is a transformer-based encoder following the RoBERTa architecture (355M parameters).
It was pretrained from scratch using byte-level BPE tokenization on a large corpus of Spanish open-access scientific publications, including theses, dissertations, and peer-reviewed papers from Colombian universities and international repositories.

The model was designed to capture scientific discourse, terminology, and abstract reasoning patterns typical of research documents in economics, engineering, medicine, and the social sciences.

Property Value
Architecture RoBERTa-large
Parameters 355M
Vocabulary size 50,262
Tokenizer Byte-Level BPE (trained from scratch)
Pretraining objective Masked Language Modeling (MLM)
Pretraining steps 92K
Max sequence length 512 tokens
Framework Transformers

Pretraining Data

The pretraining corpus includes over 11 billion tokens from Spanish academic and scientific sources:

  • Open-access repositories of Colombian universities (Universidad de los Andes, Universidad Nacional, Universidad Javeriana, Universidad del Rosario).
  • CORE API and institutional repositories (theses, dissertations, working papers).
  • Tax Statutes in Colombia

The final dataset covers multiple disciplines (economics, medicine, engineering, humanities), ensuring representation across scientific domains.

Source # Documents # Words (deduplicated) Percentage (%)
Universidad de los Andes 33,858 365,752,780 3.23
Universidad Nacional 44,686 537,022,975 4.75
CORE API 2,181,689 9,624,189,002 85.10
Universidad del Rosario 22,404 183,356,109 1.62
Universidad Javeriana 25,624 323,918,445 2.86
Tax Statutes in Colombia 392 13,924,060 0.12
Extra 2 261,131,453 2.31
Total 2,308,655 11,309,295,824 100.00

Benchmarks

Sci-BETO was fine-tuned and benchmarked across multiple downstream tasks, both general-domain and scientific:

Dataset Metric Sci-BETO Large Sci-BETO Base BETO BERTIN
WikiCAT F1 (macro) 0.7738 0.7583 0.7624 0.7598
PAWS-X (es) F1 (macro) 0.9148 0.8794 0.8985 0.8961
PharmaCoNER F1 (micro) 0.8959 0.8733 0.8845 0.8802
CANTEMIST F1 (micro) 0.8809 0.8784 0.8954 0.8956
NLI (ESNLI-R) F1 (micro)
BanRep (JEL) Exact Match 0.6116 0.6043 0.5933 0.5807
Rosario F1 (macro) 0.9203 0.9194 0.9079 0.9121
Econ-IE F1 (micro) 0.5256 0.5158 0.5199 0.4992

On average, Sci-BETO achieves comparable or superior results to general-domain Spanish models in specialized contexts (scientific, biomedical, economic), while maintaining strong performance in general text understanding.


Intended Use

  • Research and experimentation in Spanish scientific NLP.
  • Downstream fine-tuning for:
    • Text classification (scientific or academic domains),
    • Named Entity Recognition (NER),
    • Semantic similarity and paraphrase detection,
    • Knowledge extraction from academic documents.

Limitations

  • The model may underperform on highly informal or non-academic Spanish (e.g., social media).
  • It is not designed for generative tasks (e.g., text completion, chat).
  • Domain bias toward academic register and Latin American Spanish variants.
  • Pretraining corpus excludes English or bilingual data.

Example Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Flaglab/Sci-BETO-large")
model = AutoModelForMaskedLM.from_pretrained("Flaglab/Sci-BETO-large")

text = "El Banco de la República va a subir las [mask] de interes."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
masked_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
predicted_token = tokenizer.decode(logits[0, masked_index].argmax(dim=-1))
print("Predicted token:", predicted_token)
Downloads last month
53
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Flaglab/Sci-BETO-large