Sci-BETO-large
Sci-BETO is a domain-specific RoBERTa encoder pretrained entirely on Spanish scientific texts.
Model Description
Sci-BETO-large is a transformer-based encoder following the RoBERTa architecture (355M parameters).
It was pretrained from scratch using byte-level BPE tokenization on a large corpus of Spanish open-access scientific publications, including theses, dissertations, and peer-reviewed papers from Colombian universities and international repositories.
The model was designed to capture scientific discourse, terminology, and abstract reasoning patterns typical of research documents in economics, engineering, medicine, and the social sciences.
| Property | Value |
|---|---|
| Architecture | RoBERTa-large |
| Parameters | 355M |
| Vocabulary size | 50,262 |
| Tokenizer | Byte-Level BPE (trained from scratch) |
| Pretraining objective | Masked Language Modeling (MLM) |
| Pretraining steps | 92K |
| Max sequence length | 512 tokens |
| Framework | Transformers |
Pretraining Data
The pretraining corpus includes over 11 billion tokens from Spanish academic and scientific sources:
- Open-access repositories of Colombian universities (Universidad de los Andes, Universidad Nacional, Universidad Javeriana, Universidad del Rosario).
- CORE API and institutional repositories (theses, dissertations, working papers).
- Tax Statutes in Colombia
The final dataset covers multiple disciplines (economics, medicine, engineering, humanities), ensuring representation across scientific domains.
| Source | # Documents | # Words (deduplicated) | Percentage (%) |
|---|---|---|---|
| Universidad de los Andes | 33,858 | 365,752,780 | 3.23 |
| Universidad Nacional | 44,686 | 537,022,975 | 4.75 |
| CORE API | 2,181,689 | 9,624,189,002 | 85.10 |
| Universidad del Rosario | 22,404 | 183,356,109 | 1.62 |
| Universidad Javeriana | 25,624 | 323,918,445 | 2.86 |
| Tax Statutes in Colombia | 392 | 13,924,060 | 0.12 |
| Extra | 2 | 261,131,453 | 2.31 |
| Total | 2,308,655 | 11,309,295,824 | 100.00 |
Benchmarks
Sci-BETO was fine-tuned and benchmarked across multiple downstream tasks, both general-domain and scientific:
| Dataset | Metric | Sci-BETO Large | Sci-BETO Base | BETO | BERTIN |
|---|---|---|---|---|---|
| WikiCAT | F1 (macro) | 0.7738 | 0.7583 | 0.7624 | 0.7598 |
| PAWS-X (es) | F1 (macro) | 0.9148 | 0.8794 | 0.8985 | 0.8961 |
| PharmaCoNER | F1 (micro) | 0.8959 | 0.8733 | 0.8845 | 0.8802 |
| CANTEMIST | F1 (micro) | 0.8809 | 0.8784 | 0.8954 | 0.8956 |
| NLI (ESNLI-R) | F1 (micro) | — | — | — | — |
| BanRep (JEL) | Exact Match | 0.6116 | 0.6043 | 0.5933 | 0.5807 |
| Rosario | F1 (macro) | 0.9203 | 0.9194 | 0.9079 | 0.9121 |
| Econ-IE | F1 (micro) | 0.5256 | 0.5158 | 0.5199 | 0.4992 |
On average, Sci-BETO achieves comparable or superior results to general-domain Spanish models in specialized contexts (scientific, biomedical, economic), while maintaining strong performance in general text understanding.
Intended Use
- Research and experimentation in Spanish scientific NLP.
- Downstream fine-tuning for:
- Text classification (scientific or academic domains),
- Named Entity Recognition (NER),
- Semantic similarity and paraphrase detection,
- Knowledge extraction from academic documents.
Limitations
- The model may underperform on highly informal or non-academic Spanish (e.g., social media).
- It is not designed for generative tasks (e.g., text completion, chat).
- Domain bias toward academic register and Latin American Spanish variants.
- Pretraining corpus excludes English or bilingual data.
Example Usage
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Flaglab/Sci-BETO-large")
model = AutoModelForMaskedLM.from_pretrained("Flaglab/Sci-BETO-large")
text = "El Banco de la República va a subir las [mask] de interes."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
masked_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
predicted_token = tokenizer.decode(logits[0, masked_index].argmax(dim=-1))
print("Predicted token:", predicted_token)
- Downloads last month
- 53