Model Card for Model ID
T5 model trained on Bulgarian literature, Web, Parallel English-Bulgarian texts, Bulgarian and English Wikipedia, and other datasets - uncased.
Model Details
403M parameter T5 model trained on 35B (41B depending on tokenization) tokens for 3 epochs with T5 Span Corruption objective.
Tokenizer vocabulary size is 50176.
Model hidden dimension is 1024.
Feed-Forward dimension is 4096.
Hidden layer count is 12 for both the encoder and the decoder.
Developed by: Artificial Inteligence and Language Technologies Department at Institute of Information and Communication Technologies - Bulgarian Academy of Sciences.
Funded by: The model is pretrained within the CLaDA-BG: National Interdisciplinary Research E-Infrastructure for Bulgarian Language and Cultural heritage - member of the pan-European research consortia CLARIN-ERIC & DARIAH-ERIC, funded by the Ministry of Education and Science of Bulgaria (support for the Bulgarian National Roadmap for Research Infrastructure). The training was performed at the supercomputer HEMUS at IICT-BAS, part of the RIs of the CoE on Informatics and ICT, financed by the OP SESG (2014–2020), and co-financed by the European Union through the ESIF.
Model type: T5
Language(s) (NLP): Bulgarian.
License: MIT
Uses
The model is intended to be used as a base model for fine-tuning tasks in NLP.
Direct Use
>>> import torch
>>> from transformers import (
>>> T5ForConditionalGeneration,
>>> PreTrainedTokenizerFast
>>> )
>>> model = T5ForConditionalGeneration.from_pretrained('AIaLT-IICT/t5_bg_base_uncased')
>>> tokenizer = PreTrainedTokenizerFast.from_pretrained('AIaLT-IICT/t5_bg_base_uncased')
>>> prompt = "Тръгнах по[SEN_0] и срещу мен изкочи[SEN_1]. Първата ми[SEN_2], но после погледнах в очите й и ми се сториха[SEN_3]."
>>> model_inputs = tokenizer([prompt], return_tensors="pt", add_special_tokens=True, return_token_type_ids=False)
>>> generated_ids = model.generate(**model_inputs)
>>> tokenizer.decode(generated_ids[0])
'[CLS][SEN_0] улицата[SEN_1] една кукла[SEN_2] реакция беше да я спра[SEN_3] доста[SEN_4][SEP]'
Out-of-Scope Use
The model is trained on span corruption task. If you want to use the model for any other type of text generation it is recommended to fine-tune it.
Recommendations
It is recommended to use the model for text generation fine-tuning tasks. The encoder of the model alone can be used for text and token classification.
Training Details
Training Data
Trained on 29B tokens consisting of deduplicated union of:
- uonlp/CulturaX
- MaCoCu-bg 2.0
- HPLT 2.0 Bulgarian (Cyrillic) cleaned
- Literature
- Wikipedia
- others
Training Procedure
Trained with the T5 Span Corruption objective with 25% noise density, 3 tokens mean noise span length for 3 epochs with bf16 mixed precision, 512 tokens input length and batch size of 256*512 tokens.
Evaluation
The model is evaluated on the T5 Span Corruption objective that it was trained on. It achieves test loss of 1.61 and test accuracy of 68.17%
Model Card Authors
Nikolay Paev, Kiril Simov
Model Card Contact
- Downloads last month
- 2