REPRESENTATION LEARNING WITH SPECTRO-TEMPORAL-CHANNEL ATTENTION FOR SPEECH EMOTION RECOGNITION

被引:33
作者
Guo, Lili [1 ]
Wang, Longbiao [1 ]
Xu, Chenglin [3 ]
Dang, Jianwu [1 ,4 ,5 ]
Chng, Eng Siong [2 ]
Li, Haizhou [3 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin Key Lab Cognit Comp & Applicat, Tianjin, Peoples R China
[2] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore
[3] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore, Singapore
[4] Japan Adv Inst Sci & Technol, Nomi, Ishikawa, Japan
[5] Pengcheng Lab, Shenzhen, Peoples R China
来源
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
speech emotion recognition; spectro-temporal attention; channel attention; representation learning;
D O I
10.1109/ICASSP39728.2021.9414006
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Convolutional neural network (CNN) is found to be effective in learning representation for speech emotion recognition. CNNs do not explicitly model the associations or relative importance of features in the spectral/temporal/channel-wise axes. In this paper, we propose an attention module, named spectro-temporal-channel (STC) attention module that is integrated with CNN to improve representation learning ability. Our module infers an attention map along the three dimensions, namely time, frequency, and CNN channel. Experiments are conducted on the IEMOCAP database to evaluate the effectiveness of the proposed representation learning method. The results demonstrate that the proposed method outperforms the traditional CNN method by an absolute increase of 3.13% in terms of F1 score.
引用
收藏
页码:6304 / 6308
页数:5
相关论文
共 24 条
[1]  
[Anonymous], 2017, ARXIV170108071, DOI DOI 10.1109/JSTSP.2017.2764438
[2]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[3]   Spectrogram Image Feature for Sound Event Classification in Mismatched Conditions [J].
Dennis, Jonathan ;
Tran, Huy Dat ;
Li, Haizhou .
IEEE SIGNAL PROCESSING LETTERS, 2011, 18 (02) :130-133
[4]  
Guizzo E, 2020, INT CONF ACOUST SPEE, P6489, DOI [10.1109/ICASSP40776.2020.9053727, 10.1109/icassp40776.2020.9053727]
[5]  
Guo LL, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P2666, DOI 10.1109/ICASSP.2018.8462219
[6]  
Han K, 2014, INTERSPEECH, P223
[7]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[8]  
Hsiao PW, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P2526, DOI 10.1109/ICASSP.2018.8461431
[9]   Extreme learning machine: Theory and applications [J].
Huang, Guang-Bin ;
Zhu, Qin-Yu ;
Siew, Chee-Kheong .
NEUROCOMPUTING, 2006, 70 (1-3) :489-501
[10]  
Jalal MA, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P853, DOI [10.1109/asru46091.2019.9004037, 10.1109/ASRU46091.2019.9004037]