Model Card for Model ID

T5 model trained on Bulgarian literature, Web, Parallel English-Bulgarian texts, Bulgarian and English Wikipedia, and other datasets - uncased.

Model Details

403M parameter T5 model trained on 35B (41B depending on tokenization) tokens for 3 epochs with T5 Span Corruption objective.

Uses

The model is intended to be used as a base model for fine-tuning tasks in NLP.

Direct Use

>>> import torch
>>> from transformers import (
>>>     T5ForConditionalGeneration,
>>>     PreTrainedTokenizerFast
>>> )

>>> model = T5ForConditionalGeneration.from_pretrained('AIaLT-IICT/t5_bg_base_uncased')
>>> tokenizer = PreTrainedTokenizerFast.from_pretrained('AIaLT-IICT/t5_bg_base_uncased')

>>> prompt = "Тръгнах по[SEN_0] и срещу мен изкочи[SEN_1]. Първата ми[SEN_2], но после погледнах в очите й и ми се сториха[SEN_3]."

>>> model_inputs = tokenizer([prompt], return_tensors="pt", add_special_tokens=True, return_token_type_ids=False)
>>> generated_ids = model.generate(**model_inputs)
>>> tokenizer.decode(generated_ids[0])

'[CLS][SEN_0] улицата[SEN_1] една кукла[SEN_2] реакция беше да я спра[SEN_3] доста[SEN_4][SEP]'

Out-of-Scope Use

The model is trained on span corruption task. If you want to use the model for any other type of text generation it is recommended to fine-tune it.

Recommendations

It is recommended to use the model for text generation fine-tuning tasks. The encoder of the model alone can be used for text and token classification.

Training Details

Training Data

Trained on 29B tokens consisting of deduplicated union of:

Training Procedure

Trained with the T5 Span Corruption objective with 25% noise density, 3 tokens mean noise span length for 3 epochs with bf16 mixed precision, 512 tokens input length and batch size of 256*512 tokens.

Evaluation

The model is evaluated on the T5 Span Corruption objective that it was trained on. It achieves test loss of 1.61 and test accuracy of 68.17%

Model Card Authors

Nikolay Paev, Kiril Simov

Model Card Contact

nikolay.paev@iict.bas.bg

Downloads last month
2
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train AIaLT-IICT/t5_bg_base_uncased