model_max_length in tokenizer_config.json is not correct and cause bug in tokenization step.

#46
by ErfanMP - opened
No description provided.

Tokenizer loads the value for "max_model_length" from tokenizer_config.json. (On line 2082 of this file: transformers/tokenization_utils_base.py)
Currently the value for "max_model_length" is set to 1024 which is inconsistent with "max_length" and "max_target_positions" both set in model's config.json. and causes error when the truncation process relies on it.
The suggested PR simply fixes this value and sets it to "448" which is the correct value for this model.

Temporary fixes:

  • Pass the max_length=448 parameter when you are conducting the tokenization.
  • Set the correct value after loading the tokenizer: processor.tokenizer.model_max_length = 448 or processor.tokenizer.model_max_length = model.max_target_positions

Not addressing the above bug would cause an Error similar to the following Error:
ValueError: Labels' sequence length 499 cannot exceed the maximum allowed length of 448 tokens.
Due to a check in WhisperForConditionalGeneration.forward:
1685 if labels.shape[1] > self.max_target_positions:

ErfanMP changed pull request status to open
Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment