DiaLLM Dialect Classifier

A fine-tuned DeBERTa-v3-base model for classifying English text into three dialect varieties: en-AU (Australian), en-IN (Indian), and en-UK (British).

Trained as part of the DiaLLM project — a study of dialect-adapted language models using CPT, SFT, DPO, GRPO, and GSPO across Gemma, Llama, and Qwen model families. Used as an independent evaluation metric to assess whether generated text exhibits target-dialect characteristics.

Usage

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="jordanpainter/diallm-dialect-classifier",
)

classifier("I reckon it's a ripper idea, mate.")
# [{'label': 'en-AU', 'score': 0.87}]

Labels: en-AU, en-IN, en-UK.

Training Data

Fine-tuned on BESSTIE-CW-26, a dataset of 6,243 naturally occurring English sentences annotated for dialect variety. All splits were pooled and re-split 80/10/10 with stratification to ensure balanced dialect representation in dev and test.

Split en-AU en-IN en-UK Total
Train ~1,619 ~1,973 ~1,693 ~5,285
Val ~202 ~246 ~211 ~659
Test 192 234 201 627

Training Details

Hyperparameter Value
Base model microsoft/deberta-v3-base
Epochs 5 (early stopping, patience 2)
Batch size 16
Learning rate 2e-5
Warmup ratio 0.1
Weight decay 0.01
Max length 512
Hardware 1× NVIDIA A100

Evaluation

Test-set results (627 examples, stratified):

Dialect Precision Recall F1
en-AU 0.6808 0.7552 0.7160
en-IN 0.8982 0.8675 0.8826
en-UK 0.7234 0.6766 0.6992
macro avg 0.7675 0.7664 0.7660
accuracy 0.7719

Indian English is the most separable class; Australian and British English share substantial lexical overlap, leading to some inter-class confusion between the two.

Limitations

  • Trained on BESSTIE-CW-26, which contains shorter, naturally occurring sentences — performance may vary on longer generated text.
  • Confusion between en-AU and en-UK is expected given their shared orthographic conventions.
  • Not intended for high-stakes dialect identification; best used as a soft signal in aggregate across many examples.
Downloads last month
143
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train jordanpainter/diallm-dialect-classifier