ESERNet: Learning spectrogram structure relationship for effective speech emotion recognition with swin transformer in classroom discourse analysis

被引：2

作者：

Liu, Tingting ^{[1
,2
]}

Wang, Minghong ^{[1
]}

Yang, Bing ^{[2
]}

Liu, Hai ^{[2
,3
]}

Yi, Shaoxin ^{[3
]}

机构：

[1] Univ Hong Kong, Fac Educ, Hong Kong 999077, Peoples R China

[2] Hubei Univ, Sch Educ, Wuhan 430062, Peoples R China

[3] Cent China Normal Univ, Natl Engn Res Ctr E Learning, Wuhan 430079, Peoples R China

来源：

NEUROCOMPUTING | 2025年 / 612卷

关键词：

Speech emotion recognition; Intelligent education; Feature extraction; Swin Transformer; Classroom discourse analysis;

D O I：

10.1016/j.neucom.2024.128711

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Speech emotion recognition (SER) has received increased attention due to its extensive applications in many fields, especially in the analysis of teacher-student dialogue in classroom environment. It can help teachers to better learn about students' emotions and thereby adjust teaching activities. However, SER has faced several challenges, such as the intrinsic ambiguity of emotions and the complex task of interpreting emotions from speech in noisy environments. These issues can result in reduced recognition accuracy due to a focus on less relevant or insignificant features. To address these challenges, this paper presents ESERNet, a Transformer-based model designed to effectively extract crucial clues from speech data by capturing both pivotal cues and longrange relationships in speech signal. The major contribution of our approach is a two-pathway SER framework. By leveraging the Transformer architecture, ESERNet captures long-range dependencies within speech mel-spectrograms, enabling a refined understanding of the emotional cues embedded in speech signals. Extensive experiments were conducted on the IEMOCAP and EmoDB datasets, the results show that ESERNet achieves state-of-the-art performance in SER and outperforms existing methods by effectively leveraging critical clues and capturing long-range dependencies in speech data. These results highlight the effectiveness of the model in addressing the complex challenges associated with SER tasks.

引用

页数：12

共 15 条

[1] MelTrans: Mel-Spectrogram Relationship-Learning for Speech Emotion Recognition via Transformers
Li, Hui
Li, Jiawen
Liu, Hai
Liu, Tingting
Chen, Qiang
You, Xinge
SENSORS, 2024, 24 (17)
[2] Experimental Analysis and Selection of Spectrogram Features for Speech Emotion Recognition
Tang, Gui-Chen
Liang, Rui-Yu
Feng, Yue-Qin
Wang, Qing-Yun
INTERNATIONAL CONFERENCE ON MECHANICS, BUILDING MATERIAL AND CIVIL ENGINEERING (MBMCE 2015), 2015, : 757 - 762
[3] Learning Mutual Correlation in Multimodal Transformer for Speech Emotion Recognition
Wang, Yuhua
Shen, Guang
Xu, Yuezhu
Li, Jiahang
Zhao, Zhengdao
INTERSPEECH 2021, 2021, : 4518 - 4522
[4] Investigation of the effect of spectrogram images and different texture analysis methods on speech emotion recognition
Ozseven, Turgut
APPLIED ACOUSTICS, 2018, 142 : 70 - 77
[5] Effective MLP and CNN based ensemble learning for speech emotion recognition
Middya A.I.
Nag B.
Roy S.
Multimedia Tools and Applications, 2024, 83 (36) : 83963 - 83990
[6] On the Effect of Log-Mel Spectrogram Parameter Tuning for Deep Learning-Based Speech Emotion Recognition
Mukhamediya, Azamat
Fazli, Siamac
Zollanvari, Amin
IEEE ACCESS, 2023, 11 : 61950 - 61957
[7] Speech Emotion Recognition using Feature Selection with Adaptive Structure Learning
Rayaluru, Akshay
Bandela, Surekha Reddy
Kumar, T. Kishore
2019 IEEE INTERNATIONAL SYMPOSIUM ON SMART ELECTRONIC SYSTEMS (ISES 2019), 2019, : 233 - 236
[8] Focus-attention-enhanced Crossmodal Transformer with Metric Learning for Multimodal Speech Emotion Recognition
Kim, Keulbit
Cho, Namhyun
INTERSPEECH 2023, 2023, : 2673 - 2677
[9] Transformer-based transfer learning and multi-task learning for improving the performance of speech emotion recognition
Park, Sunchan
Kim, Hyung Soon
JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2021, 40 (05): : 515 - 522
[10] A Study on a Speech Emotion Recognition System with Effective Acoustic Features Using Deep Learning Algorithms
Byun, Sung-Woo
Lee, Seok-Pil
APPLIED SCIENCES-BASEL, 2021, 11 (04): : 1 - 15

← 1 2 →