Two-level discriminative speech emotion recognition model with wave field dynamics: A personalized speech emotion recognition method

被引:3
作者
Jia, Ning [1 ]
Zheng, Chunjun [1 ]
机构
[1] Dalian Neusoft Univ Informat, Sch Software, Dalian, Peoples R China
关键词
Speech emotion recognition; Speaker classification; Wave field dynamics; Cross medium; Convolutional recurrent neural network; Two-level discriminative model;
D O I
10.1016/j.comcom.2021.09.013
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Presently available speech emotion recognition (SER) methods generally rely on a single SER model. Getting a higher accuracy of SER involves feature extraction method and model design scheme in the speech. However, the generalization performance of models is typically poor because the emotional features of different speakers can vary substantially. The present work addresses this issue by applying a two-level discriminative model to the SER task. The first level places an individual speaker within a specific speaker group according to the speaker's characteristics. The second level constructs a personalized SER model for each group of speakers using the wave field dynamics model and a dual-channel general SER model. Two-level discriminative model are fused for implementing an ensemble learning scheme to achieve effective SER classification. The proposed method is demonstrated to provide higher SER accuracy in experiments based on interactive emotional dynamic motion capture (IEMOCAP) corpus and a custom-built SER corpus. In IEMOCAP corpus, the proposed model improves the recognition accuracy by 7%. In custom-built SER corpus, both masked and unmasked speakers is employed to demonstrate that the proposed method maintains higher SER accuracy.
引用
收藏
页码:161 / 170
页数:10
相关论文
共 36 条
  • [1] [Anonymous], 1997, P 5 EUROPEAN C SPEEC, DOI DOI 10.21437/EUROSPEECH.1997-494
  • [2] Burkhardt F., 2005, P 9 EUR C SPEECH COM, P1517, DOI DOI 10.21437/INTERSPEECH.2005-446
  • [3] IEMOCAP: interactive emotional dyadic motion capture database
    Busso, Carlos
    Bulut, Murtaza
    Lee, Chi-Chun
    Kazemzadeh, Abe
    Mower, Emily
    Kim, Samuel
    Chang, Jeannette N.
    Lee, Sungbok
    Narayanan, Shrikanth S.
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
  • [4] Chernykh V., 2017, EMOTION RECOGNITION
  • [5] Emotion recognition in human-computer interaction
    Cowie, R
    Douglas-Cowie, E
    Tsapatsoulis, N
    Votsis, G
    Kollias, S
    Fellenz, W
    Taylor, JG
    [J]. IEEE SIGNAL PROCESSING MAGAZINE, 2001, 18 (01) : 32 - 80
  • [6] The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing
    Eyben, Florian
    Scherer, Klaus R.
    Schuller, Bjoern W.
    Sundberg, Johan
    Andre, Elisabeth
    Busso, Carlos
    Devillers, Laurence Y.
    Epps, Julien
    Laukka, Petri
    Narayanan, Shrikanth S.
    Truong, Khiet P.
    [J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2016, 7 (02) : 190 - 202
  • [7] Graves A, 2013, INT CONF ACOUST SPEE, P6645, DOI 10.1109/ICASSP.2013.6638947
  • [8] THE VERA AM MITTAG GERMAN AUDIO-VISUAL EMOTIONAL SPEECH DATABASE
    Grimm, Michael
    Kroschel, Kristian
    Narayanan, Shrikanth
    [J]. 2008 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-4, 2008, : 865 - +
  • [9] Harrag A, 2005, INDICON 2005 Proceedings, P237
  • [10] Speech Emotion Recognition from Variable-Length Inputs with Triplet Loss Function
    Huang, Jian
    Li, Ya
    Tao, Jianhua
    Lian, Zheng
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3673 - 3677