Revisiting Modeling and Evaluation Approaches in Speech Emotion Recognition: Considering Subjectivity of Annotators and Ambiguity of Emotions
Abstract
Embracing minority ratings, multiple annotators, and multi-emotion predictions in speech emotion recognition improves system robustness and alignment with human perception.
Over the past two decades, speech emotion recognition (SER) has received growing attention. To train SER systems, researchers collect emotional speech databases annotated by crowdsourced or in-house raters who select emotions from predefined categories. However, disagreements among raters are common. Conventional methods treat these disagreements as noise, aggregating labels into a single consensus target. While this simplifies SER as a single-label task, it ignores the inherent subjectivity of human emotion perception. This dissertation challenges such assumptions and asks: (1) Should minority emotional ratings be discarded? (2) Should SER systems learn from only a few individuals' perceptions? (3) Should SER systems predict only one emotion per sample? Psychological studies show that emotion perception is subjective and ambiguous, with overlapping emotional boundaries. We propose new modeling and evaluation perspectives: (1) Retain all emotional ratings and represent them with soft-label distributions. Models trained on individual annotator ratings and jointly optimized with standard SER systems improve performance on consensus-labeled tests. (2) Redefine SER evaluation by including all emotional data and allowing co-occurring emotions (e.g., sad and angry). We propose an ``all-inclusive rule'' that aggregates all ratings to maximize diversity in label representation. Experiments on four English emotion databases show superior performance over majority and plurality labeling. (3) Construct a penalization matrix to discourage unlikely emotion combinations during training. Integrating it into loss functions further improves performance. Overall, embracing minority ratings, multiple annotators, and multi-emotion predictions yields more robust and human-aligned SER systems.
Community
Huang-Cheng Chou's PhD Thesis; ACLCLP Doctoral Dissertation Award -- Honorable Mention
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reasoning Beyond Majority Vote: An Explainable SpeechLM Framework for Speech Emotion Recognition (2025)
- From Joy to Fear: A Benchmark of Emotion Estimation in Pop Song Lyrics (2025)
- EmoQ: Speech Emotion Recognition via Speech-Aware Q-Former and Large Language Model (2025)
- Few-shot Personalization via In-Context Learning for Speech Emotion Recognition based on Speech-Language Model (2025)
- Semantic Differentiation in Speech Emotion Recognition: Insights from Descriptive and Expressive Speech Roles (2025)
- More Similar than Dissimilar: Modeling Annotators for Cross-Corpus Speech Emotion Recognition (2025)
- MSF-SER: Enriching Acoustic Modeling with Multi-Granularity Semantics for Speech Emotion Recognition (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper