Fix eos_token init and \n\n tokenization

by CISCai - opened May 28

base: refs/heads/main

←

from: refs/pr/3

Discussion Files changed

-0

Fix eos_token init and \n\n tokenizationfc0a14c0

CISCai

May 28

Just setting eos_token to \n\n will cause transformers to add it to the end of the vocab (index 65530) and tokenization will then use this new token instead of the original token (index 261).

handle AddedToken input22c0bea9

CISCai

May 28

FYI, setting eos_token to \n\n in the first place breaks tokenization in itself as special tokens will be pretokenized by transformers, causing sequences such as \n \n\n to be tokenized to 262 261 instead of 3330 11 as in the original tokenizer!

zhiyuan8

fla-hub org May 28

Please contribute to RWKV-LM, since we only transform RWKV to fla's format.

CISCai

May 28

Please contribute to RWKV-LM, since we only transform RWKV to fla's format.

These are your changes to make in run in transformers is it not, none of this is in original code.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment