MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition

被引：3

作者：

Ong, Kah Liang ^{[1
]}

Lee, Chin Poo ^{[1
]}

Lim, Heng Siong ^{[2
]}

Lim, Kian Ming ^{[1
]}

Alqahtani, Ali ^{[3
,4
]}

机构：

[1] Multimedia Univ, Fac Informat Sci & Technol, Melaka 75450, Malaysia

[2] Multimedia Univ, Fac Engn & Technol, Melaka 75450, Malaysia

[3] King Khalid Univ, Dept Comp Sci, Abha 61421, Saudi Arabia

[4] King Khalid Univ, Ctr Artificial Intelligence CAI, Abha 61421, Saudi Arabia

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Speech recognition; Emotion recognition; Spectrogram; Feature extraction; Support vector machines; Transformers; Mel frequency cepstral coefficient; Ensemble learning; Visualization; Speech emotion recognition; ensemble learning; spectrogram; vision transformer; Emo-DB; RAVDESS; IEMOCAP;

D O I：

10.1109/ACCESS.2024.3360483

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Vision Transformers, known for their innovative architectural design and modeling capabilities, have gained significant attention in computer vision. This paper presents a dual-path approach that leverages the strengths of the Multi-Axis Vision Transformer (MaxViT) and the Improved Multiscale Vision Transformer (MViTv2). It starts by encoding speech signals into Constant-Q Transform (CQT) spectrograms and Mel Spectrograms with Short-Time Fourier Transform (Mel-STFT). The CQT spectrogram is then fed into the MaxViT model, while the Mel-STFT is input to the MViTv2 model to extract informative features from the spectrograms. These features are integrated and passed into a Multilayer Perceptron (MLP) model for final classification. This hybrid model is named the "MaxViT and MViTv2 Fusion Network with Multilayer Perceptron (MaxMViT-MLP)." The MaxMViT-MLP model achieves remarkable results with an accuracy of 95.28% on the Emo-DB, 89.12% on the RAVDESS dataset, and 68.39% on the IEMOCAP dataset, substantiating the advantages of integrating multiple audio feature representations and Vision Transformers in speech emotion recognition.

引用

页码：18237 / 18250

页数：14

共 50 条

[31] HIERARCHICAL NETWORK BASED ON THE FUSION OF STATIC AND DYNAMIC FEATURES FOR SPEECH EMOTION RECOGNITION
Cao, Qi
Hou, Mixiao
Chen, Bingzhi
Zhang, Zheng
Lu, Guangming
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6334 - 6338
[32] A dynamic-static feature fusion learning network for speech emotion recognition
Xue, Peiyun
Gao, Xiang
Bai, Jing
Dong, Zhenan
Wang, Zhiyu
Xu, Jiangshuai
NEUROCOMPUTING, 2025, 633
[33] Feature Fusion of Speech Emotion Recognition Based on Deep Learning
Liu, Gang
He, Wei
Jin, Bicheng
PROCEEDINGS OF 2018 INTERNATIONAL CONFERENCE ON NETWORK INFRASTRUCTURE AND DIGITAL CONTENT (IEEE IC-NIDC), 2018, : 193 - 197
[34] Comparison of Neural Network Models for Speech Emotion Recognition
Palo, Hemanta Kumar
Sagar, Sangeet
2ND INTERNATIONAL CONFERENCE ON DATA SCIENCE AND BUSINESS ANALYTICS (ICDSBA 2018), 2018, : 127 - 131
[35] Speech Emotion Recognition Based on Feature Fusion
Shen, Qi
Chen, Guanggen
Chang, Lin
PROCEEDINGS OF THE 2017 2ND INTERNATIONAL CONFERENCE ON MATERIALS SCIENCE, MACHINERY AND ENERGY ENGINEERING (MSMEE 2017), 2017, 123 : 1071 - 1074
[36] Use of Different Features for Emotion Recognition Using MLP Network
Palo, H. K.
Mohanty, Mihir Narayana
Chandra, Mahesh
COMPUTATIONAL VISION AND ROBOTICS, 2015, 332 : 7 - 15
[37] Anchor Model Fusion for Emotion Recognition in Speech
Ortego-Resa, Carlos
Lopez-Moreno, Ignacio
Ramos, Daniel
Gonzalez-Rodriguez, Joaquin
BIOMETRIC ID MANAGEMENT AND MULTIMODAL COMMUNICATION, PROCEEDINGS, 2009, 5707 : 49 - 56
[38] TDFNet: Transformer-Based Deep-Scale Fusion Network for Multimodal Emotion Recognition
Zhao, Zhengdao
Wang, Yuhua
Shen, Guang
Xu, Yuezhu
Zhang, Jiayuan
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 3771 - 3782
[39] Speech Emotion Recognition Based on Deep Belief Network
Shi, Peng
2018 IEEE 15TH INTERNATIONAL CONFERENCE ON NETWORKING, SENSING AND CONTROL (ICNSC), 2018,
[40] Head Fusion: Improving the Accuracy and Robustness of Speech Emotion Recognition on the IEMOCAP and RAVDESS Dataset
Xu, Mingke
Zhang, Fan
Zhang, Wei
IEEE ACCESS, 2021, 9 : 74539 - 74549

← 1 2 3 4 5 →