DistilBERT-document-classifier

Summary

This model presents a method for fine-tuning distilbert/distilbert-base-uncased on a dataset of approximately 8,000 synthetic and randomized document samples. Randomization has been introduced with token shuffling and python Faker library.

Dataset

The dataset was generated using LangChain's wrapper around GPT-4o-mini, with additional randomization performed by GPT-4.5. The goal was to create a dataset that is 90% clean, while intentionally introducing 10% of samples with OCR-like noise and artifacts. These imperfections are characterized by:

  • Excessive spacing between words (e.g., three or more spaces instead of one),
  • Erratic line breaks,
  • Common OCR misreads (e.g., the number "1" in place of a capital "I", or "3" in place of "E").

Supported Document Types

This model is designed to work exclusively with the following document types:

  • Invoices
  • UK Driving Licenses
  • US Driving Licenses
  • Contracts
  • Passports (all nationalities)

Supported Languages

Currently, the model supports documents written in English only.

Prediction Output

This model's prediction is a numeric label, that you can match with its string equivalent, by introducing the following mapping in your code:

  {
    0: 'invoice',
    1: 'driving_license',
    2: 'contract',
    3: 'passport'
  }
Downloads last month
1
Safetensors
Model size
67M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for kris-szczepaniak/DistilBERT-document-classifier

Finetuned
(9951)
this model