DistilBERT-document-classifier
Summary
This model presents a method for fine-tuning distilbert/distilbert-base-uncased
on a dataset of approximately 8,000 synthetic and randomized document samples.
Randomization has been introduced with token shuffling and python Faker
library.
Dataset
The dataset was generated using LangChain's wrapper around GPT-4o-mini, with additional randomization performed by GPT-4.5. The goal was to create a dataset that is 90% clean, while intentionally introducing 10% of samples with OCR-like noise and artifacts. These imperfections are characterized by:
- Excessive spacing between words (e.g., three or more spaces instead of one),
- Erratic line breaks,
- Common OCR misreads (e.g., the number "1" in place of a capital "I", or "3" in place of "E").
Supported Document Types
This model is designed to work exclusively with the following document types:
- Invoices
- UK Driving Licenses
- US Driving Licenses
- Contracts
- Passports (all nationalities)
Supported Languages
Currently, the model supports documents written in English
only.
Prediction Output
This model's prediction is a numeric label, that you can match with its string equivalent, by introducing the following mapping in your code:
{
0: 'invoice',
1: 'driving_license',
2: 'contract',
3: 'passport'
}
- Downloads last month
- 1
Model tree for kris-szczepaniak/DistilBERT-document-classifier
Base model
distilbert/distilbert-base-uncased