Speech Emotion Classification with Parallel Architecture of Deep Learning and Multi-Head Attention Transformer

被引:0
作者
Nguyen, An Hoang [1 ,2 ]
Trang, Kien [1 ,2 ]
Thao, Nguyen Gia Minh [3 ]
Vuong, Bao Quoc [1 ,2 ]
Ton-That, Long [1 ,2 ]
机构
[1] Int Univ, Sch Elect Engn, Ho Chi Minh City, Vietnam
[2] Vietnam Natl Univ Ho Chi Minh City, Ho Chi Minh City, Vietnam
[3] Toyota Technol Inst, Grad Sch Engn, Electromagnet Energy Syst Lab, Toyota, Japan
来源
2023 62ND ANNUAL CONFERENCE OF THE SOCIETY OF INSTRUMENT AND CONTROL ENGINEERS, SICE | 2023年
关键词
Speech emotion recognition; parallel deep learning; Mel-Spectrogram; Multi-head attention; Transformer; RECOGNITION; SPECTROGRAM;
D O I
10.23919/SICE59929.2023.10354088
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speech is the most direct method of human communication with a high level of efficiency, which contains a lot of information about the speaker's feelings. The ability to recognize and distinguish between different emotions with sentences is a necessary component in intelligent applications of human-computer interaction (HCI). For the purpose of creating a more natural and intuitive way of communication between humans and automation control systems, emotional expressions conveyed through signal forms need to be recognized and processed accordingly. In this paper, the authors propose to appropriately apply parallel Deep Learning SENet, CNN block, and Transformer with Multi-head Attention method to effectively distinguish the features of different emotional states in the user voice recording data. The speech record samples from an open-source RAVDESS dataset were applied to assess the performance of the training model during the research. The highest results of the proposed model have achieved 82.67% of average accuracy on the test set.
引用
收藏
页码:1549 / 1554
页数:6
相关论文
共 21 条
[1]  
Aldeneh Z, 2017, INT CONF ACOUST SPEE, P2741, DOI 10.1109/ICASSP.2017.7952655
[2]   The Impact of Attention Mechanisms on Speech Emotion Recognition [J].
Chen, Shouyan ;
Zhang, Mingyan ;
Yang, Xiaofen ;
Zhao, Zhijia ;
Zou, Tao ;
Sun, Xinqi .
SENSORS, 2021, 21 (22)
[3]   Emotion recognition in human-computer interaction [J].
Cowie, R ;
Douglas-Cowie, E ;
Tsapatsoulis, N ;
Votsis, G ;
Kollias, S ;
Fellenz, W ;
Taylor, JG .
IEEE SIGNAL PROCESSING MAGAZINE, 2001, 18 (01) :32-80
[4]   An Image-based Deep Spectrum Feature Representation for the Recognition of Emotional Speech [J].
Cummins, Nicholas ;
Amiriparian, Shahin ;
Hagerer, Gerhard ;
Batliner, Anton ;
Steidl, Stefan ;
Schuller, Bjorn W. .
PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, :478-484
[5]   A Two-Stage Hierarchical Bilingual Emotion Recognition System Using a Hidden Markov Model and Neural Networks [J].
Deriche, Mohamed ;
Absa, Ahmed H. Abo .
ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2017, 42 (12) :5231-5249
[6]   Evaluating deep learning architectures for Speech Emotion Recognition [J].
Fayek, Haytham M. ;
Lech, Margaret ;
Cavedon, Lawrence .
NEURAL NETWORKS, 2017, 92 :60-68
[7]   Audio-Visual Emotion-Aware Cloud Gaming Framework [J].
Hossain, M. Shamim ;
Muhammad, Ghulam ;
Song, Biao ;
Hassan, Mohammad Mehedi ;
Alelaiwi, Abdulhameed ;
Alamri, Atif .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2015, 25 (12) :2105-2118
[8]  
Hu J, 2018, PROC CVPR IEEE, P7132, DOI [10.1109/TPAMI.2019.2913372, 10.1109/CVPR.2018.00745]
[9]   Speech emotion recognition with deep convolutional neural networks [J].
Issa, Dias ;
Demirci, M. Fatih ;
Yazici, Adnan .
BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2020, 59
[10]   Choice of a classifier, based on properties of a dataset: case study-speech emotion recognition [J].
Koolagudi, Shashidhar G. ;
Murthy, Y. V. Srinivasa ;
Bhaskar, Siva P. .
INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2018, 21 (01) :167-183