Factors in Emotion Recognition With Deep Learning Models Using Speech and Text on Multiple Corpora

被引:16
作者
Braunschweiler, Norbert [1 ]
Doddipatla, Rama [1 ]
Keizer, Simon [1 ]
Stoyanchev, Svetlana [1 ]
机构
[1] Toshiba Europe Ltd, Cambridge Res Lab, Cambridge CB4 0GZ, England
关键词
Emotion recognition; Speech recognition; Data models; Bit error rate; Deep learning; Acoustics; Training; Speech processing; natural language processing; emotion recognition; deep learning; multimodal;
D O I
10.1109/LSP.2022.3151551
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Emotion recognition performance of deep learning models is influenced by multiple factors such as acoustic condition, textual content, style of emotion expression (e.g. acted, natural), etc. In this paper, multiple factors are analysed by training and evaluating state-of-the-art deep learning models using the input modalities speech, text, and their combination across 6 emotional speech corpora. A novel deep learning model architecture is presented that further improves the state-of-the-art in multimodal emotion recognition with speech and text on the IEMOCAP corpus. Results from models trained on individual corpora show that combining speech and text improves performance only on corpora where the text of utterances varies across different emotions, while it reduced performance on corpora with fixed text expressed in different emotions, where the speech-only models performed better. Further, cross-corpus investigations are presented to understand the robustness to changing acoustic and textual content. Results show that models perform significantly better in matched conditions in particular single corpus models perform better than multi-corpus models, with the latter showing a tendency to be more robust to acoustic variations, while performance still depends on characteristics of both training corpora and test corpus.
引用
收藏
页码:722 / 726
页数:5
相关论文
共 32 条
[11]   Empirical Interpretation of Speech Emotion Perception with Attention Based Model for Speech Emotion Recognition [J].
Jalal, Md Asif ;
Milner, Rosanna ;
Hain, Thomas .
INTERSPEECH 2020, 2020, :4113-4117
[12]  
Kim JC, 2017, INT CONF UBIQ ROBOT, P39
[13]   Conversational Emotion Analysis via Attention Mechanisms [J].
Lian, Zheng ;
Tao, Jianhua ;
Liu, Bin ;
Huang, Jian .
INTERSPEECH 2019, 2019, :1936-1940
[14]   CTNet: Conversational Transformer Network for Emotion Recognition [J].
Lian, Zheng ;
Liu, Bin ;
Tao, Jianhua .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :985-1000
[15]   The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English [J].
Livingstone, Steven R. ;
Russo, Frank A. .
PLOS ONE, 2018, 13 (05)
[16]  
LUDWIG, 2019, TOOLB ALL US TRAIN T
[17]  
LUDWIG audio features, 2019, AUD FIL TRANSF FILT
[18]  
Milner R, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P304, DOI [10.1109/ASRU46091.2019.9003838, 10.1109/asru46091.2019.9003838]
[19]  
Pepino L, 2020, INT CONF ACOUST SPEE, P6484, DOI [10.1109/ICASSP40776.2020.9054709, 10.1109/icassp40776.2020.9054709]
[20]  
Priya K, 2019, INT CONF ADVAN COMPU, P1049, DOI [10.1109/icaccs.2019.8728458, 10.1109/ICACCS.2019.8728458]