Chunk-Level Speech Emotion Recognition: A General Framework of Sequence-to-One Dynamic Temporal Modeling

被引:19
作者
Lin, Wei-Cheng [1 ]
Busso, Carlos [1 ]
机构
[1] Univ Texas Dallas, Erik Jonsson Sch Engn & Comp Sci, Richardson, TX 75080 USA
基金
美国国家科学基金会;
关键词
Sequence-to-one modeling; speech emotion recognition; attention model; chunk-level modeling; CORPUS;
D O I
10.1109/TAFFC.2021.3083821
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A critical issue of current speech-based sequence-to-one learning tasks, such as speech emotion recognition (SER), is the dynamic temporal modeling for speech sentences with different durations. The goal is to extract an informative representation vector of the sentence from acoustic feature sequences with varied length. Traditional methods rely on static descriptions such as statistical functions or a universal background model (UBM), which are not capable of characterizing dynamic temporal changes. Recent advances in deep learning architectures provide promising results, directly extracting sentence-level representations from frame-level features. However, conventional cropping and padding techniques that deal with varied length sequences are not optimal, since they truncate or artificially add sentence-level information. Therefore, we propose a novel dynamic chunking approach, which maps the original sequences of different lengths into a fixed number of chunks that have the same duration by adjusting their overlap. This simple chunking procedure creates a flexible framework that can incorporate different feature extractions and sentence-level temporal aggregation approaches to cope, in a principled way, with different sequence-to-one tasks. Our experimental results based on three databases demonstrate that the proposed framework provides: 1) improvement in recognition accuracy, 2) robustness toward different temporal length predictions, and 3) high model computational efficiency advantages.
引用
收藏
页码:1215 / 1227
页数:13
相关论文
共 56 条
[1]  
Abdelwahab M, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5084, DOI 10.1109/ICASSP.2018.8461866
[2]  
Aldeneh Z, 2017, INT CONF ACOUST SPEE, P2741, DOI 10.1109/ICASSP.2017.7952655
[3]  
[Anonymous], 2012, International Journal of Synthetic Emotions, DOI [DOI 10.4018/JSE.2012010101, 10.4018/jse.2012010101, DOI 10.4018/jse.2012010101]
[4]  
[Anonymous], 2013, INT C KNOWL SMART TE, DOI DOI 10.1109/KST.2013.6512793
[5]  
[Anonymous], 2010, ACM Sigkdd Explorations Newsletter
[6]   Shape-based modeling of the fundamental frequency contour for emotion detection in speech [J].
Arias, Juan Pablo ;
Busso, Carlos ;
Yoma, Nestor Becerra .
COMPUTER SPEECH AND LANGUAGE, 2014, 28 (01) :278-294
[7]   Increasing the Reliability of Crowdsourcing Evaluations Using Online Quality Assessment [J].
Burmania, Alec ;
Parthasarathy, Srinivas ;
Busso, Carlos .
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2016, 7 (04) :374-388
[8]   MSP-IMPROV: An Acted Corpus of Dyadic Interactions to Study Emotion Perception [J].
Busso, Carlos ;
Parthasarathy, Srinivas ;
Burmania, Alec ;
AbdelWahab, Mohammed ;
Sadoughi, Najmeh ;
Provost, Emily Mower .
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2017, 8 (01) :67-80
[9]  
Busso C, 2012, 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, P1178
[10]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359