arxiv:2510.13998

BitNet Distillation

Published on Oct 15

· Submitted by

HUANG SHAOHAN on Oct 17

Microsoft Research

Upvote

Authors:

Abstract

BitNet Distillation fine-tunes large language models to 1.58-bit precision using SubLN, multi-head attention distillation, and continual pre-training, achieving comparable performance with significant memory and inference speed improvements.

AI-generated summary

In this paper, we present BitNet Distillation (BitDistill), a lightweight pipeline that fine-tunes off-the-shelf full-precision LLMs (e.g., Qwen) into 1.58-bit precision (i.e., ternary weights {-1, 0, 1}) for specific downstream tasks, achieving strong task-specific performance with minimal computational cost. Specifically, BitDistill incorporates three key techniques: the SubLN module, as introduced in BitNet; multi-head attention distillation, based on MiniLM; and continual pre-training, which serves as a crucial warm-up step to mitigate the scalability issue of the performance gap between finetuned full-precision and 1.58-bit LLMs on specific tasks. Experimental results show that BitDistill achieves performance comparable to the full-precision counterpart models across model size, while enabling up to 10x memory savings and 2.65x faster inference on CPUs. Code is available at https://github.com/microsoft/BitNet.

View arXiv page View PDF GitHub 24.2k Add to collection

Community

buaahsh

Paper submitter 3 days ago

Raspbfox

2 days ago

Is there any way to convert those distilled 1.58-bit weights to GGUF afterwards? Would love to experiment with this, but I need to deploy on llama.cpp :o