Differential Impacts of Monologue and Conversation on Speech Emotion Recognition

被引:0
作者
Chien, Woan-Shiuan [1 ]
Upadhyay, Shreya G. [1 ]
Lin, Wei-Cheng [2 ]
Busso, Carlos [3 ]
Lee, Chi-Chun [1 ]
机构
[1] Natl Tsing Hua Univ, Dept Elect Engn, Hsinchu 30013, Taiwan
[2] Univ Texas Dallas, Erik Jonsson Sch Engn & Commputer Sci, Richardson, TX 75080 USA
[3] Carnegie Mellon Univ, Language Technol Inst, Sch Comp Sci, Pittsburgh, PA 15213 USA
关键词
Emotion recognition; Acoustics; Databases; Oral communication; Training; Speech recognition; Affective computing; Data collection; Recording; Psychology; Monologue; conversation; speech emotion recognition; emotion perception; acoustic variability; VOCAL EXPRESSION; FEATURES; CORPUS; COMMUNICATION; PERFORMANCE; ONLINE;
D O I
10.1109/TAFFC.2024.3509138
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The advancement of Speech Emotion Recognition (SER) is significantly dependent on the quality of emotional speech corpora used for model training. Researchers in the field of SER have developed various corpora by adjusting design parameters to enhance the reliability of the training source. For this study, we focus on exploring communication modes of collection, specifically analyzing spontaneous emotional speech patterns gathered during conversation or monologue. While conversations are acknowledged as effective for eliciting authentic emotional expressions, systematic analyses are necessary to confirm their reliability as a better source of emotional speech data. We investigate this research question from perceptual differences and acoustic variability present in both emotional speeches. Our analyses on multi-lingual corpora show that, first, raters exhibit higher consistency for conversation recordings when evaluating categorical emotions, and second, perceptions and acoustic patterns observed in conversational samples align more closely with expected trends discussed in relevant emotion literature. We further examine the impact of these differences on SER modeling, which shows that we can train a more robust and stable SER model by using conversation data. This work provides comprehensive evidence suggesting that conversation may offer a better source compared to monologue for developing an SER model.
引用
收藏
页码:485 / 498
页数:14
相关论文
共 75 条
[1]   Deep Learning Techniques for Speech Emotion Recognition, from Databases to Models [J].
Abbaschian, Babak Joze ;
Sierra-Sosa, Daniel ;
Elmaghraby, Adel .
SENSORS, 2021, 21 (04) :1-27
[2]   Online and face-to-face classroom multitasking and academic performance: Moderated mediation with self-efficacy for self-regulated learning and gender [J].
Alghamdi, Ahlam ;
Karpinski, Aryn C. ;
Lepp, Andrew ;
Barkley, Jacob .
COMPUTERS IN HUMAN BEHAVIOR, 2020, 102 :214-222
[3]  
[Anonymous], 2001, Appraisal processes in emotion: Theory, methods, research, DOI DOI 10.1093/OSO/9780195130072.003.0005
[4]   Introducing the Geneva Multimodal Expression Corpus for Experimental Research on Emotion Perception [J].
Baenziger, Tanja ;
Mortillaro, Marcello ;
Scherer, Klaus R. .
EMOTION, 2012, 12 (05) :1161-1179
[5]  
Baevski A, 2020, Arxiv, DOI arXiv:1910.05453
[6]   Acoustic profiles in vocal emotion expression [J].
Banse, R ;
Scherer, KR .
JOURNAL OF PERSONALITY AND SOCIAL PSYCHOLOGY, 1996, 70 (03) :614-636
[7]   The role of intonation in emotional expressions [J].
Bänziger, T ;
Scherer, KR .
SPEECH COMMUNICATION, 2005, 46 (3-4) :252-267
[8]   Comparative Study on Normalisation in Emotion Recognition from Speech [J].
Boeck, Ronald ;
Egorow, Olga ;
Siegert, Ingo ;
Wendemuth, Andreas .
INTELLIGENT HUMAN COMPUTER INTERACTION, IHCI 2017, 2017, 10688 :189-201
[9]  
Boersma P., 1996, Tech. Rep. 132
[10]   What makes dialogues easy to understand? [J].
Branigan, Holly P. ;
Catchpole, Ciara M. ;
Pickering, Martin J. .
LANGUAGE AND COGNITIVE PROCESSES, 2011, 26 (10) :1667-1686