DistilBERT-document-classifier

Summary

This model presents a method for fine-tuning distilbert/distilbert-base-uncased on a dataset of approximately 8,000 synthetic and randomized document samples. Randomization has been introduced with token shuffling and python Faker library.

Dataset

The dataset was generated using LangChain's wrapper around GPT-4o-mini, with additional randomization performed by GPT-4.5. The goal was to create a dataset that is 90% clean, while intentionally introducing 10% of samples with OCR-like noise and artifacts. These imperfections are characterized by:

Excessive spacing between words (e.g., three or more spaces instead of one),
Erratic line breaks,
Common OCR misreads (e.g., the number "1" in place of a capital "I", or "3" in place of "E").

Supported Document Types

This model is designed to work exclusively with the following document types:

Invoices
UK Driving Licenses
US Driving Licenses
Contracts
Passports (all nationalities)

Supported Languages

Currently, the model supports documents written in English only.

Prediction Output

This model's prediction is a numeric label, that you can match with its string equivalent, by introducing the following mapping in your code:

  {
    0: 'invoice',
    1: 'driving_license',
    2: 'contract',
    3: 'passport'
  }

Downloads last month: 1

Safetensors

Model size

67M params

Tensor type

F32

Model tree for kris-szczepaniak/DistilBERT-document-classifier

Base model

distilbert/distilbert-base-uncased

Finetuned

(9951)

this model