Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition

被引：21

作者：

Zhang, Hua ^{[1
,2
]}

Gou, Ruoyun ^{[1
]}

Shang, Jili ^{[1
]}

Shen, Fangyao ^{[1
]}

Wu, Yifan ^{[1
,3
]}

Dai, Guojun ^{[1
]}

机构：

[1] HangZhou Dianzi Univ, Sch Comp Sci & Technol, Hangzhou, Peoples R China

[2] Zhejiang Univ, Key Lab Network Multimedia Technol Zhejiang Prov, Hangzhou, Peoples R China

[3] HangzhouDianzi Univ, Key Lab Brain Machine Collaborat Intelligence Zhe, Hangzhou, Peoples R China

来源：

FRONTIERS IN PHYSIOLOGY | 2021年 / 12卷

基金：

中国国家自然科学基金;

关键词：

speech emotion recognition; deep convolutional neural network; attention mechanism; long short-term memory; deep neural network; FEATURES;

D O I：

10.3389/fphys.2021.643202

中图分类号：

Q4 [生理学];

学科分类号：

071003 ;

摘要：

Speech emotion recognition (SER) is a difficult and challenging task because of the affective variances between different speakers. The performances of SER are extremely reliant on the extracted features from speech signals. To establish an effective features extracting and classification model is still a challenging task. In this paper, we propose a new method for SER based on Deep Convolution Neural Network (DCNN) and Bidirectional Long Short-Term Memory with Attention (BLSTMwA) model (DCNN-BLSTMwA). We first preprocess the speech samples by data enhancement and datasets balancing. Secondly, we extract three-channel of log Mel-spectrograms (static, delta, and delta-delta) as DCNN input. Then the DCNN model pre-trained on ImageNet dataset is applied to generate the segment-level features. We stack these features of a sentence into utterance-level features. Next, we adopt BLSTM to learn the high-level emotional features for temporal summarization, followed by an attention layer which can focus on emotionally relevant features. Finally, the learned high-level emotional features are fed into the Deep Neural Network (DNN) to predict the final emotion. Experiments on EMO-DB and IEMOCAP database obtain the unweighted average recall (UAR) of 87.86 and 68.50%, respectively, which are better than most popular SER methods and demonstrate the effectiveness of our propose method.

引用

页数：13

共 33 条

[1] Convolutional Neural Networks for Speech Recognition [J].

Abdel-Hamid, Ossama ;

Mohamed, Abdel-Rahman ;

Jiang, Hui ;

Deng, Li ;

Penn, Gerald ;

Yu, Dong .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) :1533-1545

[2]

[Anonymous], 2023, 2018 CHIN CONTR DEC, DOI DOI 10.1109/IGARSS52108.2023.10282718

[3]

Badshah AM, 2017, 2017 INTERNATIONAL CONFERENCE ON PLATFORM TECHNOLOGY AND SERVICE (PLATCON), P125

[4]

Bhaykar M, 2013, NATL CONF COMMUN

[5]

Burkhardt F., 2005, 9 EUROPEAN C SPEECH

[6] IEMOCAP: interactive emotional dyadic motion capture database [J].

Busso, Carlos ;

Bulut, Murtaza ;

Lee, Chi-Chun ;

Kazemzadeh, Abe ;

Mower, Emily ;

Kim, Samuel ;

Chang, Jeannette N. ;

Lee, Sungbok ;

Narayanan, Shrikanth S. .

LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359

[7]

Chan W, 2015, INT CONF ACOUST SPEE, P2056, DOI 10.1109/ICASSP.2015.7178332

[8] Speech emotion recognition: Features and classification models [J].

Chen, Lijiang ;

Mao, Xia ;

Xue, Yuli ;

Cheng, Lee Lung .

DIGITAL SIGNAL PROCESSING, 2012, 22 (06) :1154-1160

[9] 3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition [J].

Chen, Mingyi ;

He, Xuanji ;

Yang, Jing ;

Zhang, Han .

IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (10) :1440-1444

[10]

Choi WooYong., 2018, Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), P28, DOI DOI 10.18653/V1/W18-3304

← 1 2 3 4 →