Speech emotion recognition based on hierarchical attributes using feature nets

被引:15
作者
Zhao, Huijuan [1 ,2 ]
Ye, Ning [3 ]
Wang, Ruchuan [3 ,4 ]
机构
[1] Nanjing Univ Posts & Telecommun, Coll Internet Things, Nanjing, Peoples R China
[2] Nanjing Inst Ind Technol, Coll Comp & Software, Nanjing, Peoples R China
[3] Nanjing Univ Posts & Telecommun, Coll Comp, Nanjing, Peoples R China
[4] Jiangsu High Technol Res Key Lab Wireless Sensor, Nanjing, Peoples R China
关键词
Speech emotion recognition; multi-task learning; transfer learning; deep learning; feature nets;
D O I
10.1080/17445760.2019.1626854
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Speech emotion recognition is a challenging topic and has many important applications in our real life, especially in terms of human-computer interaction. Traditional methods are based on the pipeline of pre-processing, feature extraction, dimensionality reduction and emotion classification. Previous studies have focussed on emotion recognition based on two different models: discrete model and continuous model. Both the speaker's age and gender affect the speech emotion recognition in the two models. Moreover, investigation results shown that the dimensional attributes of emotion such as arousal, valence and dominance are related to each other. Based on these observations, we propose a new attributes recognition model using Feature Nets, aims to improve the emotion recognition performance and generalisation capabilities. The method utilises the corpus to train the age and gender classification model, which will be transferred to the main model: a hierarchical deep learning model, using age and gender as the high level attributes of the emotion. The public databases EMO-DB and IEMOCAP have been conducted to evaluate the performance both in the classification task and regression task. Experiment results show that the proposed approach based on attributes transferring can improve the recognition accuracy, no matter transferring age or gender.
引用
收藏
页码:354 / 364
页数:11
相关论文
共 35 条
[1]   Keystroke Biometric Systems for User Authentication [J].
Ali, Md Liakat ;
Monaco, John V. ;
Tappert, Charles C. ;
Qiu, Meikang .
JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2017, 86 (2-3) :175-190
[2]   Gender-Driven Emotion Recognition Through Speech Signals for Ambient Intelligence Applications [J].
Bisio, Igor ;
Delfino, Alessandro ;
Lavagetto, Fabio ;
Marchese, Mario ;
Sciarrone, Andrea .
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, 2013, 1 (02) :244-257
[3]  
Burkhardt F., 2005, 9 EUROPEAN C SPEECH
[4]  
Burkhardt F, 2010, LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P1562
[5]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[6]   Multitask learning [J].
Caruana, R .
MACHINE LEARNING, 1997, 28 (01) :41-75
[7]  
Chang J, 2017, INT CONF ACOUST SPEE, P5415, DOI 10.1109/ICASSP.2017.7953191
[8]  
Chen Shizhe, 2017, P 7 ANN WORKSH AUD V, P19, DOI [DOI 10.1145/3133944.3133949, DOI 10.1145/3133944.3133949.21]
[9]  
Dauphin YN., 2015, EPRINT ARXIV, V35, P285
[10]  
DAVIS IL, 1995, IROS '95 - 1995 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS: HUMAN ROBOT INTERACTION AND COOPERATIVE ROBOTS, PROCEEDINGS, VOL 3, P338, DOI 10.1109/IROS.1995.525906