DNN-HMM-Based Speaker-Adaptive Emotion Recognition Using MFCC and Epoch-Based Features

被引:44
|
作者
Fahad, Md. Shah [1 ]
Deepak, Akshay [1 ]
Pradhan, Gayadhar [2 ]
Yadav, Jainath [3 ]
机构
[1] Natl Inst Technol Patna, Dept Comp Sci & Engn, Patna, Bihar, India
[2] Natl Inst Technol Patna, Dept Elect & Commun, Patna, Bihar, India
[3] Cent Univ South Bihar, Dept Comp Sci, Gaya, India
关键词
Emotion recognition; Epoch-based features; Deep neural network (DNN); Gaussian mixture model (GMM); Hidden Markov model (HMM); Speaker-adaptive training (SAT); Zero-time windowing (ZTW); DEEP NEURAL-NETWORK; SPEECH; TEXT;
D O I
10.1007/s00034-020-01486-8
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Speech emotion recognition (SER) systems are often evaluated in a speaker-independent manner. However, the variation in the acoustic features of different speakers used during training and evaluation results in a significant drop in the accuracy during evaluation. While speaker-adaptive techniques have been used for speech recognition, to the best of our knowledge, they have not been employed for emotion recognition. Motivated by this, a speaker-adaptive DNN-HMM-based SER system is proposed in this paper. Feature space maximum likelihood linear regression technique has been used for speaker adaptation during both training and testing phases. The proposed system uses MFCC and epoch-based features. We have exploited our earlier work on robust detection of epochs from emotional speech to obtain emotion-specific epoch-based features, namely instantaneous pitch, phase, and the strength of excitation. The combined feature set improves on the MFCC features, which have been the baseline for SER systems in the literature by + 5.07% and over the state-of-the-art techniques by + 7.13 %. While using just the MFCC features, the proposed model improves upon the state-of-the-art techniques by 2.06%. These results bring out the importance of speaker adaptation for SER systems and highlight the complementary nature of the MFCC and epoch-based features for emotion recognition using speech. All experiments were carried out an IEMOCAP emotional dataset.
引用
收藏
页码:466 / 489
页数:24
相关论文
共 50 条
  • [1] DNN-HMM-Based Speaker-Adaptive Emotion Recognition Using MFCC and Epoch-Based Features
    Md. Shah Fahad
    Akshay Deepak
    Gayadhar Pradhan
    Jainath Yadav
    Circuits, Systems, and Signal Processing, 2021, 40 : 466 - 489
  • [2] Uncertainty weighting and propagation in DNN-HMM-based speech recognition
    Novoa, Jose
    Fredes, Josue
    Poblete, Victor
    Becerra Yoma, Nestor
    COMPUTER SPEECH AND LANGUAGE, 2018, 47 : 30 - 46
  • [3] Speaker-Adaptive Speech Recognition Based on Surface Electromyography
    Wand, Michael
    Schultz, Tanja
    BIOMEDICAL ENGINEERING SYSTEMS AND TECHNOLOGIES, 2010, 52 : 271 - 285
  • [4] DNN-HMM-based automatic speech recognition system for intelligent LED lighting control
    Xian, J. L.
    Cai, W. X.
    Pan, H. X.
    Chen, N. Z.
    Chen, X. Y.
    Sun, Y. W.
    Yan, D.
    AUTOMATIC CONTROL, MECHATRONICS AND INDUSTRIAL ENGINEERING, 2019, : 73 - 78
  • [5] Roles of the Average Voice in Speaker-adaptive HMM-based Speech Synthesis
    Yamagishi, Junichi
    Watts, Oliver
    King, Simon
    Usabaev, Bela
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 418 - +
  • [6] A Speaker-Adaptive HMM-based Vietnamese Text-to-Speech System
    Ninh, Duy Khanh
    PROCEEDINGS OF 2019 11TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE 2019), 2019, : 342 - 346
  • [7] Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis
    Yamagishi, Junichi
    Nose, Takashi
    Zen, Heiga
    Ling, Zhen-Hua
    Toda, Tomoki
    Tokuda, Keiichi
    King, Simon
    Renals, Steve
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2009, 17 (06): : 1208 - 1230
  • [8] TOWARDS SPEAKER-ADAPTIVE SPEECH RECOGNITION BASED ON SURFACE ELECTROMYOGRAPHY
    Wand, Michael
    Schultz, Tanja
    BIOSIGNALS 2009: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON BIO-INSPIRED SYSTEMS AND SIGNAL PROCESSING, 2009, : 155 - 162
  • [9] Bionic optimization of MFCC features based on speaker fast recognition
    Lin, Zhaodong
    Di, Changan
    Chen, Xiong
    APPLIED ACOUSTICS, 2021, 173
  • [10] MSVQ-based speaker-adaptive Chinese syllable recognition based on discriminative training
    Zhou, L
    Imai, S
    INTERNATIONAL JOURNAL OF ADAPTIVE CONTROL AND SIGNAL PROCESSING, 1997, 11 (07) : 569 - 583