CEAEval-Model (CEAEval-M)
CEAEval-M is the speech-LLM judge released together with our ACL paper "Evaluating the Expressive Appropriateness of Speech in Rich Contexts".
Given a Mandarin speech segment together with an ideal expressive plan inferred from its surrounding narrative context, CEAEval-M produces
<think> step-by-step comparison of ideal vs. actual expression,
with <focus_audio>โฆ</focus_audio> spans pointing to
audio-grounded cues (emotion / rhythm / intonation /
recording condition / paralinguistic events) </think>
<score>X.X</score> # overall expressive appropriateness โ [0.0, 5.0]
This is the judge half of the plannerโjudge decoupled pipeline defined
in the paper. It is designed to work with a frozen text-only planner
(Qwen3-8B) that first summarizes
long narrative context into a four-tuple
{emotion, rhythm, intonation, recording_condition} via multi-context
voting.
What's released here
- Model weights in
safetensors(4 shards) plusconfig.json,generation_config.json, tokenizer, preprocessor, and chat template. - Six extra special tokens the judge uses during training and
inference (
<think>,</think>,<score>,</score>,<focus_audio>,</focus_audio>) โ already merged into the tokenizer and embedding matrix. - A patched
modeling_*path that implements the adaptive audio attention bias mechanism described in Sec. 3.3.4 and Appendix F of the paper (region-wise bias over system-prompt / audio / CoT regions). test_datas/with anonymised sanity samples (audio + JSON) so you can verify the pipeline end-to-end without touching the main dataset.
The full inference pipeline (planner + judge, audio pre-processing, batch driver, sanity examples) lives in the code repository โ see Related resources.
Intended use and limitations
- Intended as a research benchmark and diagnostic tool for expressive-speech generation / selection, not as a standalone decision-making system. Expressive appropriateness is inherently subjective; predictions should be interpreted with appropriate human oversight.
- Trained and evaluated on Mandarin audiobook speech. Applying the model to other languages, styles, or domains (short commands, non-narrative dialogue, etc.) may produce unreliable scores.
Related resources
This model is one of three companion releases for the paper. Please use them together:
| Resource | Link |
|---|---|
| ๐ Paper | Evaluating the Expressive Appropriateness of Speech in Rich Contexts (ACL) |
| ๐ป Code | https://github.com/wangtianrui/CEAEval |
| ๐ค Model (this repo) | https://huggingface.co/TianRW/CEAEval-Model |
| ๐ Dataset (CEAEval-D) | https://huggingface.co/datasets/TianRW/CEAEval-Data |
| ๐ Project page / demo | https://wangtianrui.github.io/ceaeval/ |
License
Released under CC BY-NC 4.0 โ non-commercial academic research use only. The released weights do not contain or expose raw audio, transcripts, or any personally identifiable information.
- Downloads last month
- 90