An Ensemble Model for Multi-Level Speech Emotion Recognition

被引:27
作者
Zheng, Chunjun [1 ,2 ]
Wang, Chunli [1 ]
Jia, Ning [2 ]
机构
[1] Dalian Maritime Univ, Coll Informat Sci & Technol, Dalian 116026, Peoples R China
[2] Dalian Neusoft Univ Informat, Sch Comp & Software, Dalian 116023, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2020年 / 10卷 / 01期
基金
中国国家自然科学基金;
关键词
ensemble learning; multi-level technology; acoustic features; speech emotion recognition; deep learning model;
D O I
10.3390/app10010205
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Speech emotion recognition is a challenging and widely examined research topic in the field of speech processing. The accuracy of existing models in speech emotion recognition tasks is not high, and the generalization ability is not strong. Since the feature set and model design of effective speech directly affect the accuracy of speech emotion recognition, research on features and models is important. Because emotional expression is often correlated with the global features, local features, and model design of speech, it is often difficult to find a universal solution for effective speech emotion recognition. Based on this, the main research purpose of this paper is to generate general emotion features in speech signals from different angles, and use the ensemble learning model to perform emotion recognition tasks. It is divided into the following aspects: (1) Three expert roles of speech emotion recognition are designed. Expert 1 focuses on three-dimensional feature extraction of local signals; expert 2 focuses on extraction of comprehensive information in local data; and expert 3 emphasizes global features: acoustic feature descriptors (low-level descriptors (LLDs)), high-level statistics functionals (HSFs), and local features and their timing relationships. A single-/multiple-level deep learning model that meets expert characteristics is designed for each expert, including convolutional neural network (CNN), bi-directional long short-term memory (BLSTM), and gated recurrent unit (GRU). Convolutional recurrent neural network (CRNN), based on a combination of an attention mechanism, is used for internal training of experts. (2) By designing an ensemble learning model, each expert can play to its own advantages and evaluate speech emotions from different focuses. (3) Through experiments, the performance of various experts and ensemble learning models in emotion recognition is compared in the Interactive Emotional Dyadic Motion Capture (IEMOCAP) corpus and the validity of the proposed model is verified.
引用
收藏
页数:20
相关论文
共 29 条
  • [1] Aldeneh Z., 2017, P ACM INT C MULT INT
  • [2] Aldeneh Z., 2017, P IEEE INT C AC NEW
  • [3] [Anonymous], 2011, ADV INFORM SCI SERVI
  • [4] Chauhan R, 2011, COMM COM INF SC, V168, P359
  • [5] 3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition
    Chen, Mingyi
    He, Xuanji
    Yang, Jing
    Zhang, Han
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (10) : 1440 - 1444
  • [6] The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing
    Eyben, Florian
    Scherer, Klaus R.
    Schuller, Bjoern W.
    Sundberg, Johan
    Andre, Elisabeth
    Busso, Carlos
    Devillers, Laurence Y.
    Epps, Julien
    Laukka, Petri
    Narayanan, Shrikanth S.
    Truong, Khiet P.
    [J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2016, 7 (02) : 190 - 202
  • [7] Towards Temporal Modelling of Categorical Speech Emotion Recognition
    Han, Wenjing
    Ruan, Huabin
    Chen, Xiaomin
    Wang, Zhixiang
    Li, Haifeng
    Schuller, Bjoern
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 932 - 936
  • [8] Huang C.W., 2018, P ICASSP 2018 IEEE I
  • [9] Keren G., 2016, ARXIV160205875
  • [10] Khorram S., 2018, P INT 2018 HYD IND 2