An Ensemble Model for Multi-Level Speech Emotion Recognition

被引：27

作者：

Zheng, Chunjun ^{[1
,2
]}

Wang, Chunli ^{[1
]}

Jia, Ning ^{[2
]}

机构：

[1] Dalian Maritime Univ, Coll Informat Sci & Technol, Dalian 116026, Peoples R China

[2] Dalian Neusoft Univ Informat, Sch Comp & Software, Dalian 116023, Peoples R China

来源：

APPLIED SCIENCES-BASEL | 2020年 / 10卷 / 01期

基金：

中国国家自然科学基金;

关键词：

ensemble learning; multi-level technology; acoustic features; speech emotion recognition; deep learning model;

D O I：

10.3390/app10010205

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

Speech emotion recognition is a challenging and widely examined research topic in the field of speech processing. The accuracy of existing models in speech emotion recognition tasks is not high, and the generalization ability is not strong. Since the feature set and model design of effective speech directly affect the accuracy of speech emotion recognition, research on features and models is important. Because emotional expression is often correlated with the global features, local features, and model design of speech, it is often difficult to find a universal solution for effective speech emotion recognition. Based on this, the main research purpose of this paper is to generate general emotion features in speech signals from different angles, and use the ensemble learning model to perform emotion recognition tasks. It is divided into the following aspects: (1) Three expert roles of speech emotion recognition are designed. Expert 1 focuses on three-dimensional feature extraction of local signals; expert 2 focuses on extraction of comprehensive information in local data; and expert 3 emphasizes global features: acoustic feature descriptors (low-level descriptors (LLDs)), high-level statistics functionals (HSFs), and local features and their timing relationships. A single-/multiple-level deep learning model that meets expert characteristics is designed for each expert, including convolutional neural network (CNN), bi-directional long short-term memory (BLSTM), and gated recurrent unit (GRU). Convolutional recurrent neural network (CRNN), based on a combination of an attention mechanism, is used for internal training of experts. (2) By designing an ensemble learning model, each expert can play to its own advantages and evaluate speech emotions from different focuses. (3) Through experiments, the performance of various experts and ensemble learning models in emotion recognition is compared in the Interactive Emotional Dyadic Motion Capture (IEMOCAP) corpus and the validity of the proposed model is verified.

引用

页数：20

共 29 条

[1] Aldeneh Z., 2017, P ACM INT C MULT INT
[2] Aldeneh Z., 2017, P IEEE INT C AC NEW
[3] [Anonymous], 2011, ADV INFORM SCI SERVI
[4] Chauhan R, 2011, COMM COM INF SC, V168, P359
[5] 3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition
Chen, Mingyi
He, Xuanji
Yang, Jing
Zhang, Han
[J]. IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (10) : 1440 - 1444
[6] The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing
Eyben, Florian
Scherer, Klaus R.
Schuller, Bjoern W.
Sundberg, Johan
Andre, Elisabeth
Busso, Carlos
Devillers, Laurence Y.
Epps, Julien
Laukka, Petri
Narayanan, Shrikanth S.
Truong, Khiet P.
[J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2016, 7 (02) : 190 - 202
[7] Towards Temporal Modelling of Categorical Speech Emotion Recognition
Han, Wenjing
Ruan, Huabin
Chen, Xiaomin
Wang, Zhixiang
Li, Haifeng
Schuller, Bjoern
[J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 932 - 936
[8] Huang C.W., 2018, P ICASSP 2018 IEEE I
[9] Keren G., 2016, ARXIV160205875
[10] Khorram S., 2018, P INT 2018 HYD IND 2

← 1 2 3 →