MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition

被引：3

作者：

Ong, Kah Liang ^{[1
]}

Lee, Chin Poo ^{[1
]}

Lim, Heng Siong ^{[2
]}

Lim, Kian Ming ^{[1
]}

Alqahtani, Ali ^{[3
,4
]}

机构：

[1] Multimedia Univ, Fac Informat Sci & Technol, Melaka 75450, Malaysia

[2] Multimedia Univ, Fac Engn & Technol, Melaka 75450, Malaysia

[3] King Khalid Univ, Dept Comp Sci, Abha 61421, Saudi Arabia

[4] King Khalid Univ, Ctr Artificial Intelligence CAI, Abha 61421, Saudi Arabia

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Speech recognition; Emotion recognition; Spectrogram; Feature extraction; Support vector machines; Transformers; Mel frequency cepstral coefficient; Ensemble learning; Visualization; Speech emotion recognition; ensemble learning; spectrogram; vision transformer; Emo-DB; RAVDESS; IEMOCAP;

D O I：

10.1109/ACCESS.2024.3360483

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Vision Transformers, known for their innovative architectural design and modeling capabilities, have gained significant attention in computer vision. This paper presents a dual-path approach that leverages the strengths of the Multi-Axis Vision Transformer (MaxViT) and the Improved Multiscale Vision Transformer (MViTv2). It starts by encoding speech signals into Constant-Q Transform (CQT) spectrograms and Mel Spectrograms with Short-Time Fourier Transform (Mel-STFT). The CQT spectrogram is then fed into the MaxViT model, while the Mel-STFT is input to the MViTv2 model to extract informative features from the spectrograms. These features are integrated and passed into a Multilayer Perceptron (MLP) model for final classification. This hybrid model is named the "MaxViT and MViTv2 Fusion Network with Multilayer Perceptron (MaxMViT-MLP)." The MaxMViT-MLP model achieves remarkable results with an accuracy of 95.28% on the Emo-DB, 89.12% on the RAVDESS dataset, and 68.39% on the IEMOCAP dataset, substantiating the advantages of integrating multiple audio feature representations and Vision Transformers in speech emotion recognition.

引用

页码：18237 / 18250

页数：14

共 50 条

[21] A Multi-scale Fusion Framework for Bimodal Speech Emotion Recognition
Chen, Ming
Zhao, Xudong
INTERSPEECH 2020, 2020, : 374 - 378
[22] ScSer: Supervised Contrastive Learning for Speech Emotion Recognition using Transformers
Alaparthi, Varun Sai
Pasam, Tejeswara Reddy
Inagandla, Deepak Abhiram
Prakash, Jay
Singh, Pramod Kumar
2022 15TH INTERNATIONAL CONFERENCE ON HUMAN SYSTEM INTERACTION (HSI), 2022,
[23] Utilizing Computer Vision Algorithms to Detect and Describe Local Features in Images for Emotion Recognition from Speech
Weisskirchen, Norman
Reddy, Mainampati Vasudeva
Wendemuth, Andreas
Siegert, Ingo
PROCEEDINGS OF THE 2020 IEEE INTERNATIONAL CONFERENCE ON HUMAN-MACHINE SYSTEMS (ICHMS), 2020, : 428 - 433
[24] Enhancing Emotion Recognition in Speech Based on Self-Supervised Learning: Cross-Attention Fusion of Acoustic and Semantic Features
Deeb, Bashar M.
Savchenko, Andrey V.
Makarov, Ilya
IEEE ACCESS, 2025, 13 : 56283 - 56295
[25] SPEAKER VGG CCT: Cross-corpus Speech Emotion Recognition with Speaker Embedding and Vision Transformers
Arezzo, Alessandro
Berretti, Stefano
PROCEEDINGS OF THE 4TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA IN ASIA, MMASIA 2022, 2022,
[26] Speech Emotion Recognition via Sparse Learning-Based Fusion Model
Min, Dong-Jin
Kim, Deok-Hwan
IEEE ACCESS, 2024, 12 : 177219 - 177235
[27] Bi-Branch Vision Transformer Network for EEG Emotion Recognition
Lu, Wei
Tan, Tien-Ping
Ma, Hua
IEEE ACCESS, 2023, 11 : 36233 - 36243
[28] Effective MLP and CNN based ensemble learning for speech emotion recognition
Middya A.I.
Nag B.
Roy S.
Multimedia Tools and Applications, 2024, 83 (36) : 83963 - 83990
[29] Convolutional Neural Network with Spectrogram and Perceptual Features for Speech Emotion Recognition
Zhang, Linjuan
Wang, Longbiao
Dang, Jianwu
Guo, Lili
Guan, Haotian
NEURAL INFORMATION PROCESSING (ICONIP 2018), PT IV, 2018, 11304 : 62 - 71
[30] Attention gated tensor neural network architectures for speech emotion recognition
Pandey, Sandeep Kumar
Shekhawat, Hanumant Singh
Prasanna, S. R. M.
BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2022, 71

← 1 2 3 4 5 →