model_max_length in tokenizer_config.json is not correct and cause bug in tokenization step.
Tokenizer loads the value for "max_model_length" from tokenizer_config.json. (On line 2082 of this file: transformers/tokenization_utils_base.py)
Currently the value for "max_model_length" is set to 1024 which is inconsistent with "max_length" and "max_target_positions" both set in model's config.json. and causes error when the truncation process relies on it.
The suggested PR simply fixes this value and sets it to "448" which is the correct value for this model.
Temporary fixes:
- Pass the
max_length=448
parameter when you are conducting the tokenization. - Set the correct value after loading the tokenizer:
processor.tokenizer.model_max_length = 448
orprocessor.tokenizer.model_max_length = model.max_target_positions
Not addressing the above bug would cause an Error similar to the following Error:ValueError: Labels' sequence length 499 cannot exceed the maximum allowed length of 448 tokens.
Due to a check in WhisperForConditionalGeneration.forward
: 1685 if labels.shape[1] > self.max_target_positions: