CEAEval-Model (CEAEval-M)

CEAEval-M is the speech-LLM judge released together with our ACL paper "Evaluating the Expressive Appropriateness of Speech in Rich Contexts".

Given a Mandarin speech segment together with an ideal expressive plan inferred from its surrounding narrative context, CEAEval-M produces

<think> step-by-step comparison of ideal vs. actual expression,
        with <focus_audio>…</focus_audio> spans pointing to
        audio-grounded cues (emotion / rhythm / intonation /
        recording condition / paralinguistic events) </think>
<score>X.X</score>     # overall expressive appropriateness ∈ [0.0, 5.0]

This is the judge half of the planner–judge decoupled pipeline defined in the paper. It is designed to work with a frozen text-only planner (Qwen3-8B) that first summarizes long narrative context into a four-tuple {emotion, rhythm, intonation, recording_condition} via multi-context voting.

What's released here

Model weights in safetensors (4 shards) plus config.json, generation_config.json, tokenizer, preprocessor, and chat template.
Six extra special tokens the judge uses during training and inference (<think>, </think>, <score>, </score>, <focus_audio>, </focus_audio>) — already merged into the tokenizer and embedding matrix.
A patched modeling_* path that implements the adaptive audio attention bias mechanism described in Sec. 3.3.4 and Appendix F of the paper (region-wise bias over system-prompt / audio / CoT regions).
test_datas/ with anonymised sanity samples (audio + JSON) so you can verify the pipeline end-to-end without touching the main dataset.

The full inference pipeline (planner + judge, audio pre-processing, batch driver, sanity examples) lives in the code repository — see Related resources.

Intended use and limitations

Intended as a research benchmark and diagnostic tool for expressive-speech generation / selection, not as a standalone decision-making system. Expressive appropriateness is inherently subjective; predictions should be interpreted with appropriate human oversight.
Trained and evaluated on Mandarin audiobook speech. Applying the model to other languages, styles, or domains (short commands, non-narrative dialogue, etc.) may produce unreliable scores.

Related resources

This model is one of three companion releases for the paper. Please use them together:

Resource	Link
📄 Paper	Evaluating the Expressive Appropriateness of Speech in Rich Contexts (ACL)
💻 Code	https://github.com/wangtianrui/CEAEval
🤖 Model (this repo)	https://huggingface.co/TianRW/CEAEval-Model
📚 Dataset (CEAEval-D)	https://huggingface.co/datasets/TianRW/CEAEval-Data
🌐 Project page / demo	https://wangtianrui.github.io/ceaeval/

License

Released under CC BY-NC 4.0 — non-commercial academic research use only. The released weights do not contain or expose raw audio, transcripts, or any personally identifiable information.

Downloads last month: 90

Safetensors

Model size

9B params

Tensor type

BF16

Inference Providers NEW

Audio-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support