Attention and Feature Selection for Automatic Speech Emotion Recognition Using Utterance and Syllable-Level Prosodic Features

被引:28
作者
Ben Alex, Starlet [1 ]
Mary, Leena [2 ]
Babu, Ben P. [1 ]
机构
[1] APJ Abdul Kalam Technol Univ, Rajiv Gandhi Inst Technol, Ctr Adv Signal Proc CASP, Kottayam, Kerala, India
[2] Govt Engn Coll, Dept Elect & Commun Engn, Idukki, Kerala, India
关键词
Automatic emotion recognition (AER); Prosodic features; Syllabification; Attention mechanism; Feature selection; Score-level fusion; SELF-ASSESSED AFFECT; DEEP NEURAL-NETWORK; VOCAL EXPRESSION; CLASSIFICATION; REPRESENTATIONS; SPEAKER; EXTRACTION; LANGUAGE; SHIMMER; JITTER;
D O I
10.1007/s00034-020-01429-3
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
This work attempts to recognize emotions from human speech using prosodic information represented by variations in duration, energy, and fundamental frequency (F-0) values. For this, the speech signal is first automatically segmented into syllables. Prosodic features at the utterance (15 features) and syllable level (10 features) are extracted using the syllable boundaries and trained separately using deep neural network classifiers. The effectiveness of the proposed approach is demonstrated on German speech corpus-EMOTional Sensitivity ASistance System (EmotAsS) for people with disabilities, the dataset used for the Interspeech 2018 Atypical Affect Sub-Challenge. The initial set of prosodic features on evaluation yields an unweighted average recall (UAR) of 30.15%. A fusion of the decision scores of these features with spectral features gives a UAR of 36.71%. This paper also employs methods like attention mechanism and feature selection using resampling-based recursive feature elimination (RFE) to enhance system performance. Implementing attention and feature selection followed by a score-level fusion improves the UAR to 36.83% and 40.96% for prosodic features and overall fusion, respectively. The fusion of the scores of the best individual system of the Atypical Affect Sub-Challenge and the proposed system provides a UAR (43.71%) above the best test result reported. The effectiveness of the proposed system has also been demonstrated on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database with a UAR of 63.83%.
引用
收藏
页码:5681 / 5709
页数:29
相关论文
共 92 条
[1]  
[Anonymous], P INTERSPEECH PORTL
[2]  
[Anonymous], P INTERSPEECH 2011 F
[3]  
[Anonymous], INT C SIGN SPEECH PR
[4]  
[Anonymous], 2013, P INTERSPEECH LYON F
[5]  
[Anonymous], 2003, INTERSPEECH
[6]  
[Anonymous], 2017, ARXIV170108071, DOI DOI 10.1109/JSTSP.2017.2764438
[7]  
[Anonymous], EMOTIONAL SPEECH CLA
[8]  
[Anonymous], 2015, Tech. Rep.
[9]  
[Anonymous], P INT C SPOK LANG PR
[10]  
[Anonymous], 2010, P M ACOUSTICS 159ASA