Fix eos_token init and \n\n tokenization

#3

Just setting eos_token to \n\n will cause transformers to add it to the end of the vocab (index 65530) and tokenization will then use this new token instead of the original token (index 261).

FYI, setting eos_token to \n\n in the first place breaks tokenization in itself as special tokens will be pretokenized by transformers, causing sequences such as \n \n\n to be tokenized to 262 261 instead of 3330 11 as in the original tokenizer!

fla-hub org

Please contribute to RWKV-LM, since we only transform RWKV to fla's format.

Please contribute to RWKV-LM, since we only transform RWKV to fla's format.

These are your changes to make in run in transformers is it not, none of this is in original code.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment