Parsing on Different Layout Under same prompt

by SauravCh11 - opened 19 days ago

19 days ago

Hello, I’m currently working on building a Document Parsing System. Right now, I’m focusing on parsing Indian passports, and the model is performing well overall, except for a few issues—such as capital letters sometimes being extracted as lowercase.

My main concern is that when I receive passports from other countries, the model does not perform accurately. The primary reason is that it has not been trained on documents from countries like Congo or others. I’m now preparing a dataset for Congo, but I’m a bit unsure whether it will perform well after training.

Could you please advise how much data would be sufficient for reliable performance? At the moment, I’m considering collecting around 1,000 documents per country.

Thank you!

hf-tuner

Owner 18 days ago

hi @SauravCh11

The primary cause is likely the font used in the document images, which can make certain characters visually similar—especially in low- or medium-resolution scans. Common confusions include:

l ↔ 1
0 ↔ o, O
x ↔ X
k ↔ K

Training on high-resolution images and for a greater number of steps typically resolves most of these ambiguities.

training on different languages

your worry about other languages is totally valid – this model isn’t a proper multilingual. The base model is well suited for Latin-script languages.
for a quick test on a different language:

Apply the json2str function to your ground-truth column to obtain clean string representations.
Encode the resulting strings using the tokenizer and calculate the number of unknown tokens (<unk>).

token_ids = tokenizer.encode(json_str)
unk_token_id = tokenizer.unk_token_id
unk_counts = sum([ 1 if token_id==unk_token_id else 0 for token_id in token_ids])
unk_counts

decode token_ids to see if these <unk> tokens result in loss of information? if so, Don't fine tune for that language

hf-tuner changed discussion status to closed 7 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment