Model Card for Model ID

T5 model trained on Bulgarian literature, Web, Parallel English-Bulgarian texts, Bulgarian and English Wikipedia, and other datasets - uncased.

Model Details

403M parameter T5 model trained on 35B (41B depending on tokenization) tokens for 3 epochs with T5 Span Corruption objective.

Tokenizer vocabulary size is 50176.
Model hidden dimension is 1024.
Feed-Forward dimension is 4096.
Hidden layer count is 12 for both the encoder and the decoder.
Developed by: Artificial Inteligence and Language Technologies Department at Institute of Information and Communication Technologies - Bulgarian Academy of Sciences.
Funded by: The model is pretrained within the CLaDA-BG: National Interdisciplinary Research E-Infrastructure for Bulgarian Language and Cultural heritage - member of the pan-European research consortia CLARIN-ERIC & DARIAH-ERIC, funded by the Ministry of Education and Science of Bulgaria (support for the Bulgarian National Roadmap for Research Infrastructure). The training was performed at the supercomputer HEMUS at IICT-BAS, part of the RIs of the CoE on Informatics and ICT, financed by the OP SESG (2014–2020), and co-financed by the European Union through the ESIF.
Model type: T5
Language(s) (NLP): Bulgarian.
License: MIT

Uses

The model is intended to be used as a base model for fine-tuning tasks in NLP.

Direct Use

>>> import torch
>>> from transformers import (
>>>     T5ForConditionalGeneration,
>>>     PreTrainedTokenizerFast
>>> )

>>> model = T5ForConditionalGeneration.from_pretrained('AIaLT-IICT/t5_bg_base_uncased')
>>> tokenizer = PreTrainedTokenizerFast.from_pretrained('AIaLT-IICT/t5_bg_base_uncased')

>>> prompt = "Тръгнах по[SEN_0] и срещу мен изкочи[SEN_1]. Първата ми[SEN_2], но после погледнах в очите й и ми се сториха[SEN_3]."

>>> model_inputs = tokenizer([prompt], return_tensors="pt", add_special_tokens=True, return_token_type_ids=False)
>>> generated_ids = model.generate(**model_inputs)
>>> tokenizer.decode(generated_ids[0])

'[CLS][SEN_0] улицата[SEN_1] една кукла[SEN_2] реакция беше да я спра[SEN_3] доста[SEN_4][SEP]'

Out-of-Scope Use

The model is trained on span corruption task. If you want to use the model for any other type of text generation it is recommended to fine-tune it.

Recommendations

It is recommended to use the model for text generation fine-tuning tasks. The encoder of the model alone can be used for text and token classification.

Training Details

Training Data

Trained on 29B tokens consisting of deduplicated union of:

uonlp/CulturaX
MaCoCu-bg 2.0
HPLT 2.0 Bulgarian (Cyrillic) cleaned
Literature
Wikipedia
others

Training Procedure

Trained with the T5 Span Corruption objective with 25% noise density, 3 tokens mean noise span length for 3 epochs with bf16 mixed precision, 512 tokens input length and batch size of 256*512 tokens.

Evaluation

The model is evaluated on the T5 Span Corruption objective that it was trained on. It achieves test loss of 1.61 and test accuracy of 68.17%

Model Card Authors

Nikolay Paev, Kiril Simov

Model Card Contact

nikolay.paev@iict.bas.bg

Downloads last month: 2

Safetensors

Model size

0.4B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

AIaLT-IICT
/

t5_bg_base_uncased