SmallDoge/SmallCorpus
Viewer • Updated • 653M • 2.88k • 8
How to use DIAL-TFM/TFM-tokenizer with Transformers:
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("DIAL-TFM/TFM-tokenizer", dtype="auto")TFM-tokenizer is trained based on SmallCorpus, supporting table understanding, document retrieval, tool invocation, and reasoning.
This tokenizer was trained on 2M samples from:
from tfm.modeling.tokenization_tfm_fast import TFMTokenizerFast
tokenizer = TFMTokenizerFast.from_pretrained("DIAL-TFM/TFM-tokenizer")
text = "I am TFM, a table foundation model."
input_ids = tokenizer([text], return_tensors="pt")
print(input_ids)
{
'input_ids': tensor([[128000, 40, 1097, 350, 26691, 11, 264, 2007, 16665, 1646, 13]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
}
from tfm.modeling.tokenization_tfm_fast import TFMTokenizerFast
tokenizer = TFMTokenizerFast.from_pretrained("DIAL-TFM/TFM-tokenizer")
table = [
["Name", "Age", "City"],
["Jingze", "21", "Guangzhou"],
]
input_ids = tokenizer.batch_process_tables([table])
print(input_ids)
{
'input_ids': tensor([[ 678, 17166, 13020, 41, 287, 3059, 1691, 17198, 526, 52865]]),
'row_ids': tensor([[0, 0, 0, 1, 1, 1, 1, 1, 1, 1]]),
'col_ids': tensor([[0, 1, 2, 0, 0, 0, 1, 2, 2, 2]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
}