MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition

被引：3

作者：

Ong, Kah Liang ^{[1
]}

Lee, Chin Poo ^{[1
]}

Lim, Heng Siong ^{[2
]}

Lim, Kian Ming ^{[1
]}

Alqahtani, Ali ^{[3
,4
]}

机构：

[1] Multimedia Univ, Fac Informat Sci & Technol, Melaka 75450, Malaysia

[2] Multimedia Univ, Fac Engn & Technol, Melaka 75450, Malaysia

[3] King Khalid Univ, Dept Comp Sci, Abha 61421, Saudi Arabia

[4] King Khalid Univ, Ctr Artificial Intelligence CAI, Abha 61421, Saudi Arabia

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Speech recognition; Emotion recognition; Spectrogram; Feature extraction; Support vector machines; Transformers; Mel frequency cepstral coefficient; Ensemble learning; Visualization; Speech emotion recognition; ensemble learning; spectrogram; vision transformer; Emo-DB; RAVDESS; IEMOCAP;

D O I：

10.1109/ACCESS.2024.3360483

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Vision Transformers, known for their innovative architectural design and modeling capabilities, have gained significant attention in computer vision. This paper presents a dual-path approach that leverages the strengths of the Multi-Axis Vision Transformer (MaxViT) and the Improved Multiscale Vision Transformer (MViTv2). It starts by encoding speech signals into Constant-Q Transform (CQT) spectrograms and Mel Spectrograms with Short-Time Fourier Transform (Mel-STFT). The CQT spectrogram is then fed into the MaxViT model, while the Mel-STFT is input to the MViTv2 model to extract informative features from the spectrograms. These features are integrated and passed into a Multilayer Perceptron (MLP) model for final classification. This hybrid model is named the "MaxViT and MViTv2 Fusion Network with Multilayer Perceptron (MaxMViT-MLP)." The MaxMViT-MLP model achieves remarkable results with an accuracy of 95.28% on the Emo-DB, 89.12% on the RAVDESS dataset, and 68.39% on the IEMOCAP dataset, substantiating the advantages of integrating multiple audio feature representations and Vision Transformers in speech emotion recognition.

引用

页码：18237 / 18250

页数：14

共 50 条

[41] Feature fusion Vision Transformers using MLP-Mixer for enhanced deepfake detection
Essa, Ehab
NEUROCOMPUTING, 2024, 598
[42] A Discriminative Feature Representation Method Based on Cascaded Attention Network With Adversarial Strategy for Speech Emotion Recognition
Liu, Yang
Sun, Haoqin
Guan, Wenbo
Xia, Yuqi
Li, Yongwei
Unoki, Masashi
Zhao, Zhen
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 1063 - 1074
[43] TS-MEFM: A New Multimodal Speech Emotion Recognition Network Based on Speech and Text Fusion
Wei, Wei
Zhang, Bingkun
Wang, Yibing
MULTIMEDIA MODELING, MMM 2025, PT IV, 2025, 15523 : 454 - 467
[44] Adaptive Alignment and Time Aggregation Network for Speech-Visual Emotion Recognition
Wu, Lile
Bai, Lei
Cheng, Wenhao
Cheng, Zutian
Chen, Guanghui
IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 1181 - 1185
[45] Speech Emotion Recognition Based on Convolution Neural Network combined with Random Forest
Zheng, Li
Li, Qiao
Ban, Hua
Liu, Shuhua
PROCEEDINGS OF THE 30TH CHINESE CONTROL AND DECISION CONFERENCE (2018 CCDC), 2018, : 4143 - 4147
[46] SPEECH EMOTION RECOGNITION WITH MULTISCALE AREA ATTENTION AND DATA AUGMENTATION
Xu, Mingke
Zhang, Fan
Cui, Xiaodong
Zhang, Wei
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6319 - 6323
[47] Speech emotion recognition based on multi‐feature and multi‐lingual fusion
Chunyi Wang
Ying Ren
Na Zhang
Fuwei Cui
Shiying Luo
Multimedia Tools and Applications, 2022, 81 : 4897 - 4907
[48] Enhanced Speech Emotion Recognition Using the Cognitive Emotion Fusion Network for PTSD Detection with a Novel Hybrid Approach
Suneetha, Chappidi
Anitha, Raju
JOURNAL OF ELECTRICAL SYSTEMS, 2023, 19 (04) : 376 - 398
[49] ANN based Decision Fusion for Speech Emotion Recognition
Xu, Lu
Xu, Mingxing
Yang, Dali
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2003 - +
[50] Speech Emotion Recognition based on Multiple Feature Fusion
Jiang, Changjiang
Mao, Rong
Liu, Geng
Wang, Mingyi
2019 CHINESE AUTOMATION CONGRESS (CAC2019), 2019, : 907 - 912

← 1 2 3 4 5 →