|
The values in these tensors depend on the language used and are identified by the tokenizer's lang2id and id2lang attributes. |
|
In this example, load the FacebookAI/xlm-clm-enfr-1024 checkpoint (Causal language modeling, English-French): |
|
|
|
import torch |
|
from transformers import XLMTokenizer, XLMWithLMHeadModel |
|
tokenizer = XLMTokenizer.from_pretrained("FacebookAI/xlm-clm-enfr-1024") |
|
model = XLMWithLMHeadModel.from_pretrained("FacebookAI/xlm-clm-enfr-1024") |
|
|
|
The lang2id attribute of the tokenizer displays this model's languages and their ids: |
|
|
|
print(tokenizer.lang2id) |
|
{'en': 0, 'fr': 1} |
|
|
|
Next, create an example input: |
|
|
|
input_ids = torch.tensor([tokenizer.encode("Wikipedia was used to")]) # batch size of 1 |
|
|
|
Set the language id as "en" and use it to define the language embedding. |