|
You can set the source language in the tokenizer: |
|
|
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
en_text = "Do not meddle in the affairs of wizards, for they are subtle and quick to anger." |
|
fi_text = "Älä sekaannu velhojen asioihin, sillä ne ovat hienovaraisia ja nopeasti vihaisia." |
|
tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-50-many-to-many-mmt", src_lang="fi_FI") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/mbart-large-50-many-to-many-mmt") |
|
|
|
Tokenize the text: |
|
|
|
encoded_en = tokenizer(en_text, return_tensors="pt") |
|
|
|
MBart forces the target language id as the first generated token to translate to the target language. |