CEAEval-Model (CEAEval-M)

CEAEval-M is the speech-LLM judge released together with our ACL paper "Evaluating the Expressive Appropriateness of Speech in Rich Contexts".

Given a Mandarin speech segment together with an ideal expressive plan inferred from its surrounding narrative context, CEAEval-M produces

<think> step-by-step comparison of ideal vs. actual expression,
        with <focus_audio>โ€ฆ</focus_audio> spans pointing to
        audio-grounded cues (emotion / rhythm / intonation /
        recording condition / paralinguistic events) </think>
<score>X.X</score>     # overall expressive appropriateness โˆˆ [0.0, 5.0]

This is the judge half of the plannerโ€“judge decoupled pipeline defined in the paper. It is designed to work with a frozen text-only planner (Qwen3-8B) that first summarizes long narrative context into a four-tuple {emotion, rhythm, intonation, recording_condition} via multi-context voting.

What's released here

  • Model weights in safetensors (4 shards) plus config.json, generation_config.json, tokenizer, preprocessor, and chat template.
  • Six extra special tokens the judge uses during training and inference (<think>, </think>, <score>, </score>, <focus_audio>, </focus_audio>) โ€” already merged into the tokenizer and embedding matrix.
  • A patched modeling_* path that implements the adaptive audio attention bias mechanism described in Sec. 3.3.4 and Appendix F of the paper (region-wise bias over system-prompt / audio / CoT regions).
  • test_datas/ with anonymised sanity samples (audio + JSON) so you can verify the pipeline end-to-end without touching the main dataset.

The full inference pipeline (planner + judge, audio pre-processing, batch driver, sanity examples) lives in the code repository โ€” see Related resources.

Intended use and limitations

  • Intended as a research benchmark and diagnostic tool for expressive-speech generation / selection, not as a standalone decision-making system. Expressive appropriateness is inherently subjective; predictions should be interpreted with appropriate human oversight.
  • Trained and evaluated on Mandarin audiobook speech. Applying the model to other languages, styles, or domains (short commands, non-narrative dialogue, etc.) may produce unreliable scores.

Related resources

This model is one of three companion releases for the paper. Please use them together:

Resource Link
๐Ÿ“„ Paper Evaluating the Expressive Appropriateness of Speech in Rich Contexts (ACL)
๐Ÿ’ป Code https://github.com/wangtianrui/CEAEval
๐Ÿค– Model (this repo) https://huggingface.co/TianRW/CEAEval-Model
๐Ÿ“š Dataset (CEAEval-D) https://huggingface.co/datasets/TianRW/CEAEval-Data
๐ŸŒ Project page / demo https://wangtianrui.github.io/ceaeval/

License

Released under CC BY-NC 4.0 โ€” non-commercial academic research use only. The released weights do not contain or expose raw audio, transcripts, or any personally identifiable information.

Downloads last month
90
Safetensors
Model size
9B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support