MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach

被引:133
作者
Mustaqeem [1 ]
Kwon, Soonil [1 ]
机构
[1] Sejong Univ, Dept Software, Interact Technol Lab, Seoul 05006, South Korea
基金
新加坡国家研究基金会;
关键词
Affective computing; Dilated convolutional neural network; Real-time speech emotion recognition; Parallel learning; Multi-learning trick (MLT); And raw audio clips; CONVOLUTIONAL NEURAL-NETWORK; RECURRENT; FEATURES;
D O I
10.1016/j.eswa.2020.114177
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech is the most dominant source of communication among humans, and it is an efficient way for human-computer interaction (HCI) to exchange information. Nowadays, speech emotion recognition (SER) is an active research area that plays a crucial role in real-time applications. In this era, the SER system has lacked real-time speech processing. To address this problem, we propose an end-to-end real-time SER model that is based on a one-dimensional dilated convolutional neural network (DCNN). Our model used a multi-learning strategy to parallel extract spatial salient emotional features and learn long term contextual dependencies from the speech signals. We used residual blocks with a skip connection (RBSC) module-, in order to find a correlation, the emotional cues, and the sequence learning (Seq_L) module, to learn the long term contextual dependencies in the input features. Furthermore, we used a fusion layer to concatenate these learned features for the final emotion recognition task. Our model structure is quite simple, and it is capable of automatically learning salient discriminative features from the speech signals. We evaluated our model using benchmark IEMOCAP and EMO-DB datasets and obtained a high recognition accuracy, which were 73% and 90%, respectively. The experimental results indicated the significance and the efficiency of our proposed model have shown excessive assistance with the implementation of a real-time SER system. Hence, our model is capable of processing original speech signals for the emotion recognition that utilizes lightweight dilated CNN architecture that implements the multi-learning trick (MLT) approach.
引用
收藏
页数:12
相关论文
共 65 条
  • [1] [Anonymous], P 3 INT C LEARNING R
  • [2] [Anonymous], 2005, P INTERSPEECH
  • [3] [Anonymous], 1990, P ADV NEUR INF PROC
  • [4] Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features
    Anvarjon, Tursunov
    Mustaqeem
    Kwon, Soonil
    [J]. SENSORS, 2020, 20 (18) : 1 - 16
  • [5] Deep features-based speech emotion recognition for smart affective services
    Badshah, Abdul Malik
    Rahim, Nasir
    Ullah, Noor
    Ahmad, Jamil
    Muhammad, Khan
    Lee, Mi Young
    Kwon, Soonil
    Baik, Sung Wook
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (05) : 5571 - 5589
  • [6] Bai Z., 2020, PATTERN RECOGN
  • [7] A novel training method to preserve generalization of RBPNN classifiers applied to ECG signals diagnosis
    Beritelli, Francesco
    Capizzi, Giacomo
    Lo Sciuto, Grazia
    Napoli, Christian
    Wozniak, Marcin
    [J]. NEURAL NETWORKS, 2018, 108 : 331 - 338
  • [8] IEMOCAP: interactive emotional dyadic motion capture database
    Busso, Carlos
    Bulut, Murtaza
    Lee, Chi-Chun
    Kazemzadeh, Abe
    Mower, Emily
    Kim, Samuel
    Chang, Jeannette N.
    Lee, Sungbok
    Narayanan, Shrikanth S.
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
  • [9] 3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition
    Chen, Mingyi
    He, Xuanji
    Yang, Jing
    Zhang, Han
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (10) : 1440 - 1444
  • [10] Chung J., 2014, ARXIV, DOI DOI 10.48550/ARXIV.1412.3555