Fix eos_token init and \n\n tokenization
#3
by
CISCai
- opened
Just setting eos_token to \n\n
will cause transformers to add it to the end of the vocab (index 65530) and tokenization will then use this new token instead of the original token (index 261).
FYI, setting eos_token to \n\n
in the first place breaks tokenization in itself as special tokens will be pretokenized by transformers, causing sequences such as \n \n\n
to be tokenized to 262 261
instead of 3330 11
as in the original tokenizer!
Please contribute to RWKV-LM, since we only transform RWKV to fla's format.
Please contribute to RWKV-LM, since we only transform RWKV to fla's format.
These are your changes to make in run in transformers
is it not, none of this is in original code.