FineCat-NLI: Pushing NLI Encoder Performance

Community Article Published October 31, 2025

Model | Dataset | Collection

The Amazing NLI Encoder

Did you know that transformer encoder architectures can be used for zero-shot tasks? If you've never tried a tasksource model, you might be surprised how powerful they can be! Underpinning this technology is a classic NLP framework called natural language inference.

NLI models evaluate pairs of texts within a simple three-way classification: does the hypothesis contradict, remain neutral to, or follow as an entailment from the premise? This structure turns out to be remarkably flexible. By treating your text as the premise and framing a classification label as a hypothesis ("This text is a customer complaint"), you get zero-shot classification without training.

This works because NLI models learn transferable task knowledge during training. Unlike standard classifiers that must learn each new task from scratch, NLI models understand the pattern of evaluating whether statements follow from evidence.

Encoders handle thousands of inferences per second on modest hardware. They're great for real-time classification, RAG hallucination detection, content moderation, context engineering, and other lightweight tasks. NLI encoders have emerged as a flexible AI workflow component, where quick decision-making is needed.

(For practical applications, see my previous post on 6 ways to use NLI cross-encoders.)

To build better zero-shot encoders, we need better NLI training data.


The Data Quality Bottleneck

Laurer et al. demonstrated that NLI acts as a universal task, and models trained on high-quality NLI data achieve the same performance as task-trained models but with 10x less training data [1]. Their finding was that hard, informative examples are crucial for robust zero-shot performance. When models spend their training budget on trivial patterns, they fail to develop nuanced reasoning for real-world tasks.

This insight led to the creation of top-performing models like DeBERTa-v3-large-mnli-fever-anli-ling-wanli (which excluded SNLI due to quality issues). Its strong performance on benchmarks like ANLI proves that more data does not necessarily make for better NLI models.

FineCat-NLI applies this principle systematically: compile six major NLI sources, screen for quality, and reduce easy samples that don't teach robust reasoning. The goal is to produce a dataset from primary NLI sources, and pare down the number of rows to retain the most useful training examples, culminating in an open dataset that is capable of producing top quality NLI models.


FineCat-NLI: A Fine Concatenation of NLI Data

FineCat-NLI compiles six NLI sources: MNLI, SNLI, ANLI (rounds 1-3), WANLI, LingNLI, and NLI-FEVER. First I concatenate these datasets (aka BigCat-NLI), and then I applied systematic quality screening and removed easy samples from the training split.

The curation process involved running a training pilot and scoring the entire dataset. Binning the data into a histogram, it becomes clear that the majority of the data is concentrated in high score bins (>0.9); however, these easy examples contribute little to training and dilute the important data.

I downsampled and removed about 60% of the entire concatenated dataset from these bins, reducing the size of the dataset from ~2.6M to ~1M rows. By numbers, these cuts came in higher proportions from SNLI and MNLI, aligning with observations noted here. The increase in examples concentrated at lower bins also indicated label/quality issues, and I screened these through deepseek-ai/DeepSeek-V3.2-Exp.

I also incorporate knowledge distillation by adding teacher logits from MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli, complementing the refined data with excellent ANLI soft labels. This combination forms a well-rounded dataset for NLI.


Training with Distillation

To train a model, we use a two component loss function:

L=αLCE(z(s),y)+βLMSE(z(s),z(t)) \begin{equation} \mathcal{L} = \alpha \cdot \mathcal{L}_{\text{CE}}(z^{(s)}, y) + \beta \cdot \mathcal{L}_{\text{MSE}}(z^{(s)}, z^{(t)}) \end{equation}

where z(s)z^{(s)} and z(t)z^{(t)} are the student and teacher logits, yy are the ground truth labels, and α\alpha and β\beta are equally weighted at 0.5.

This approach provides two sources of supervision: soft targets from the teacher model (leveraging ANLI performance) and the ground truth labels. Combining the efficient ModernBERT architecture with knowledge from DeBERTa-v3 and high-quality training data, dleemiller/finecat-nli-l achieves competitive performance with the teacher model while being 20% faster and using >40% less memory.


NLI Evaluation Results

F1-Micro scores (equivalent to accuracy) for each dataset. Performance was measured at bs=32 using a Nvidia Blackwell PRO 6000 Max-Q.

Model finecat mnli mnli_mismatched snli anli_r1 anli_r2 anli_r3 wanli lingnli Throughput (samples/s) Peak GPU Mem (MB)
MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli 0.8233 0.9121 0.9079 0.8898 0.7960 0.6830 0.6400 0.7700 0.8821 454.96 3250.44
dleemiller/finecat-nli-l 0.8227 0.9152 0.9265 0.9162 0.7480 0.5700 0.5433 0.7706 0.8742 539.04 1838.06
tasksource/ModernBERT-large-nli 0.7959 0.8983 0.9229 0.9188 0.7260 0.5110 0.4925 0.6978 0.8504 543.44 1838.06
dleemiller/ModernCE-large-nli 0.7811 0.9088 0.9205 0.9273 0.6630 0.4860 0.4408 0.6576 0.8566 540.74 1838.06
cross-encoder/nli-deberta-v3-large 0.7618 0.9019 0.9049 0.9220 0.5300 0.4170 0.3758 0.6548 0.8466 448.35 3250.44

The dleemiller/finecat-nli-l model has top performance on MNLI benchmarks and is very good in all major NLI benchmarks. It obtains elevated performance on the adversarial ANLI suite, inheriting the teacher's robustness through distillation.

Conclusion

Thanks for reading, and stay tuned as I continue with more work on NLI in upcoming posts!

References

[1] Laurer et al., 2022. Less Annotating, More Classifying – Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT - NLI

Community

Sign up or log in to comment