DiaLLM Dialect Classifier

A fine-tuned DeBERTa-v3-base model for classifying English text into three dialect varieties: en-AU (Australian), en-IN (Indian), and en-UK (British).

Trained as part of the DiaLLM project — a study of dialect-adapted language models using CPT, SFT, DPO, GRPO, and GSPO across Gemma, Llama, and Qwen model families. Used as an independent evaluation metric to assess whether generated text exhibits target-dialect characteristics.

Usage

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="jordanpainter/diallm-dialect-classifier",
)

classifier("I reckon it's a ripper idea, mate.")
# [{'label': 'en-AU', 'score': 0.87}]

Labels: en-AU, en-IN, en-UK.

Training Data

Fine-tuned on BESSTIE-CW-26, a dataset of 6,243 naturally occurring English sentences annotated for dialect variety. All splits were pooled and re-split 80/10/10 with stratification to ensure balanced dialect representation in dev and test.

Split	en-AU	en-IN	en-UK	Total
Train	~1,619	~1,973	~1,693	~5,285
Val	~202	~246	~211	~659
Test	192	234	201	627

Training Details

Hyperparameter	Value
Base model	microsoft/deberta-v3-base
Epochs	5 (early stopping, patience 2)
Batch size	16
Learning rate	2e-5
Warmup ratio	0.1
Weight decay	0.01
Max length	512
Hardware	1× NVIDIA A100

Evaluation

Test-set results (627 examples, stratified):

Dialect	Precision	Recall	F1
en-AU	0.6808	0.7552	0.7160
en-IN	0.8982	0.8675	0.8826
en-UK	0.7234	0.6766	0.6992
macro avg	0.7675	0.7664	0.7660
accuracy			0.7719

Indian English is the most separable class; Australian and British English share substantial lexical overlap, leading to some inter-class confusion between the two.

Limitations

Trained on BESSTIE-CW-26, which contains shorter, naturally occurring sentences — performance may vary on longer generated text.
Confusion between en-AU and en-UK is expected given their shared orthographic conventions.
Not intended for high-stakes dialect identification; best used as a soft signal in aggregate across many examples.

Downloads last month: 143

Safetensors

Model size

0.2B params

Tensor type

F32

jordanpainter
/

diallm-dialect-classifier