Parsing on Different Layout Under same prompt
Hello, I’m currently working on building a Document Parsing System. Right now, I’m focusing on parsing Indian passports, and the model is performing well overall, except for a few issues—such as capital letters sometimes being extracted as lowercase.
My main concern is that when I receive passports from other countries, the model does not perform accurately. The primary reason is that it has not been trained on documents from countries like Congo or others. I’m now preparing a dataset for Congo, but I’m a bit unsure whether it will perform well after training.
Could you please advise how much data would be sufficient for reliable performance? At the moment, I’m considering collecting around 1,000 documents per country.
Thank you!
hi @SauravCh11
The primary cause is likely the font used in the document images, which can make certain characters visually similar—especially in low- or medium-resolution scans. Common confusions include:
l ↔ 1
0 ↔ o, O
x ↔ X
k ↔ K
Training on high-resolution images and for a greater number of steps typically resolves most of these ambiguities.
training on different languages
your worry about other languages is totally valid – this model isn’t a proper multilingual. The base model is well suited for Latin-script languages.
for a quick test on a different language:
- Apply the
json2strfunction to yourground-truthcolumn to obtain clean string representations. - Encode the resulting strings using the tokenizer and calculate the number of unknown tokens (
<unk>).
token_ids = tokenizer.encode(json_str)
unk_token_id = tokenizer.unk_token_id
unk_counts = sum([ 1 if token_id==unk_token_id else 0 for token_id in token_ids])
unk_counts
decode token_ids to see if these <unk> tokens result in loss of information? if so, Don't fine tune for that language