neuphonic
/

distill-neucodec

speech-language-models

Model card Files Files and versions

distill-neucodec / README.md

harryjulian's picture

Update README.md

daee7fd verified 11 days ago

|

2.51 kB

	---
	license: apache-2.0
	tags:
	- audio
	- speech
	- audio-to-audio
	- speech-language-models
	datasets:
	- amphion/Emilia-Dataset
	- facebook/multilingual_librispeech
	- CSTR-Edinburgh/vctk
	- google/fleurs
	- mozilla-foundation/common_voice_13_0
	- mythicinfinity/libritts_r
	---

	# Model Details

	Distill-NeuCodec is a version of NeuCodec with a compatible, distilled encoder.

	The distilled encoder is 10x smaller in parameter count and uses ~7.5x less MACs at inference time.

	The distilled model makes the following adjustments to the model:
	* Swap the notoriuously slow [BigCodec](https://arxiv.org/abs/2409.05377) acoustic encoder for the [SQCodec](https://arxiv.org/abs/2504.04949) acoustic encoder (70m → 36m)
	* Swap the [w2v-bert-2.0](https://huggingface.co/facebook/w2v-bert-2.0) semantic encoder for [DistilHuBERT](https://huggingface.co/ntu-spml/distilhubert) (600m → 21m)

	Our work is largely based on extending the work of [X-Codec2.0](https://huggingface.co/HKUSTAudio/xcodec2) and [SQCodec](https://arxiv.org/abs/2504.04949).

	- Developed by: Neuphonic
	- Model type: Neural Audio Codec
	- License: apache-2.0
	- Repository: https://github.com/neuphonic/neucodec
	- Paper: [arXiv](https://arxiv.org/abs/2509.09550)
	- Pre-encoded Datasets:
	- [Emilia-YODAS-EN](https://huggingface.co/datasets/neuphonic/emilia-yodas-english-neucodec)
	- More coming soon!


	## Get Started

	Use the code below to get started with the model.

	To install from pypi in a dedicated environment, using Python 3.10 or above:

	```bash
	conda create -n neucodec python=3.10
	conda activate neucodec
	pip install neucodec
	```
	Then, to use in python:

	```python
	import librosa
	import torch
	import torchaudio
	from torchaudio import transforms as T
	from neucodec import DistillNeuCodec

	model = DistillNeuCodec.from_pretrained("neuphonic/distill-neucodec")
	model.eval().cuda()

	y, sr = torchaudio.load(librosa.ex("libri1"))
	if sr != 16_000:
	y = T.Resample(sr, 16_000)(y)[None, ...] # (B, 1, T_16)

	with torch.no_grad():
	fsq_codes = model.encode_code(y)
	# fsq_codes = model.encode_code(librosa.ex("libri1")) # or directly pass your filepath!
	print(f"Codes shape: {fsq_codes.shape}")
	recon = model.decode_code(fsq_codes).cpu() # (B, 1, T_24)

	torchaudio.save("reconstructed.wav", recon[0, :, :], 24_000)
	```

	## Training Details

	The model was trained using the same data as the full model, with an additional distillation loss (MSE between distilled and original encoder ouputs).