model_max_length in tokenizer_config.json is not correct and cause bug in tokenization step.

#46

by ErfanMP - opened 6 days ago

base: refs/heads/main

←

from: refs/pr/46

Discussion Files changed

-1

ErfanMP

6 days ago

No description provided.

fixed the model_max_length value to 4481c67fc57

ErfanMP

6 days ago

Tokenizer loads the value for "max_model_length" from tokenizer_config.json. (On line 2082 of this file: transformers/tokenization_utils_base.py)
Currently the value for "max_model_length" is set to 1024 which is inconsistent with "max_length" and "max_target_positions" both set in model's config.json. and causes error when the truncation process relies on it.
The suggested PR simply fixes this value and sets it to "448" which is the correct value for this model.

Temporary fixes:

Pass the max_length=448 parameter when you are conducting the tokenization.
Set the correct value after loading the tokenizer: processor.tokenizer.model_max_length = 448 or processor.tokenizer.model_max_length = model.max_target_positions

Not addressing the above bug would cause an Error similar to the following Error:
ValueError: Labels' sequence length 499 cannot exceed the maximum allowed length of 448 tokens.
Due to a check in WhisperForConditionalGeneration.forward:
1685 if labels.shape[1] > self.max_target_positions:

ErfanMP changed pull request status to open 6 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment