Speech Emotion Classification with Parallel Architecture of Deep Learning and Multi-Head Attention Transformer

被引:0
作者
Nguyen, An Hoang [1 ,2 ]
Trang, Kien [1 ,2 ]
Thao, Nguyen Gia Minh [3 ]
Vuong, Bao Quoc [1 ,2 ]
Ton-That, Long [1 ,2 ]
机构
[1] Int Univ, Sch Elect Engn, Ho Chi Minh City, Vietnam
[2] Vietnam Natl Univ Ho Chi Minh City, Ho Chi Minh City, Vietnam
[3] Toyota Technol Inst, Grad Sch Engn, Electromagnet Energy Syst Lab, Toyota, Japan
来源
2023 62ND ANNUAL CONFERENCE OF THE SOCIETY OF INSTRUMENT AND CONTROL ENGINEERS, SICE | 2023年
关键词
Speech emotion recognition; parallel deep learning; Mel-Spectrogram; Multi-head attention; Transformer; RECOGNITION; SPECTROGRAM;
D O I
10.23919/SICE59929.2023.10354088
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speech is the most direct method of human communication with a high level of efficiency, which contains a lot of information about the speaker's feelings. The ability to recognize and distinguish between different emotions with sentences is a necessary component in intelligent applications of human-computer interaction (HCI). For the purpose of creating a more natural and intuitive way of communication between humans and automation control systems, emotional expressions conveyed through signal forms need to be recognized and processed accordingly. In this paper, the authors propose to appropriately apply parallel Deep Learning SENet, CNN block, and Transformer with Multi-head Attention method to effectively distinguish the features of different emotional states in the user voice recording data. The speech record samples from an open-source RAVDESS dataset were applied to assess the performance of the training model during the research. The highest results of the proposed model have achieved 82.67% of average accuracy on the test set.
引用
收藏
页码:1549 / 1554
页数:6
相关论文
共 21 条
  • [1] Aldeneh Z, 2017, INT CONF ACOUST SPEE, P2741, DOI 10.1109/ICASSP.2017.7952655
  • [2] The Impact of Attention Mechanisms on Speech Emotion Recognition
    Chen, Shouyan
    Zhang, Mingyan
    Yang, Xiaofen
    Zhao, Zhijia
    Zou, Tao
    Sun, Xinqi
    [J]. SENSORS, 2021, 21 (22)
  • [3] Emotion recognition in human-computer interaction
    Cowie, R
    Douglas-Cowie, E
    Tsapatsoulis, N
    Votsis, G
    Kollias, S
    Fellenz, W
    Taylor, JG
    [J]. IEEE SIGNAL PROCESSING MAGAZINE, 2001, 18 (01) : 32 - 80
  • [4] An Image-based Deep Spectrum Feature Representation for the Recognition of Emotional Speech
    Cummins, Nicholas
    Amiriparian, Shahin
    Hagerer, Gerhard
    Batliner, Anton
    Steidl, Stefan
    Schuller, Bjorn W.
    [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 478 - 484
  • [5] A Two-Stage Hierarchical Bilingual Emotion Recognition System Using a Hidden Markov Model and Neural Networks
    Deriche, Mohamed
    Absa, Ahmed H. Abo
    [J]. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2017, 42 (12) : 5231 - 5249
  • [6] Evaluating deep learning architectures for Speech Emotion Recognition
    Fayek, Haytham M.
    Lech, Margaret
    Cavedon, Lawrence
    [J]. NEURAL NETWORKS, 2017, 92 : 60 - 68
  • [7] Audio-Visual Emotion-Aware Cloud Gaming Framework
    Hossain, M. Shamim
    Muhammad, Ghulam
    Song, Biao
    Hassan, Mohammad Mehedi
    Alelaiwi, Abdulhameed
    Alamri, Atif
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2015, 25 (12) : 2105 - 2118
  • [8] Hu J, 2018, PROC CVPR IEEE, P7132, DOI [10.1109/TPAMI.2019.2913372, 10.1109/CVPR.2018.00745]
  • [9] Speech emotion recognition with deep convolutional neural networks
    Issa, Dias
    Demirci, M. Fatih
    Yazici, Adnan
    [J]. BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2020, 59
  • [10] Choice of a classifier, based on properties of a dataset: case study-speech emotion recognition
    Koolagudi, Shashidhar G.
    Murthy, Y. V. Srinivasa
    Bhaskar, Siva P.
    [J]. INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2018, 21 (01) : 167 - 183