MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition

被引：3

作者：

Ong, Kah Liang ^{[1
]}

Lee, Chin Poo ^{[1
]}

Lim, Heng Siong ^{[2
]}

Lim, Kian Ming ^{[1
]}

Alqahtani, Ali ^{[3
,4
]}

机构：

[1] Multimedia Univ, Fac Informat Sci & Technol, Melaka 75450, Malaysia

[2] Multimedia Univ, Fac Engn & Technol, Melaka 75450, Malaysia

[3] King Khalid Univ, Dept Comp Sci, Abha 61421, Saudi Arabia

[4] King Khalid Univ, Ctr Artificial Intelligence CAI, Abha 61421, Saudi Arabia

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Speech recognition; Emotion recognition; Spectrogram; Feature extraction; Support vector machines; Transformers; Mel frequency cepstral coefficient; Ensemble learning; Visualization; Speech emotion recognition; ensemble learning; spectrogram; vision transformer; Emo-DB; RAVDESS; IEMOCAP;

D O I：

10.1109/ACCESS.2024.3360483

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Vision Transformers, known for their innovative architectural design and modeling capabilities, have gained significant attention in computer vision. This paper presents a dual-path approach that leverages the strengths of the Multi-Axis Vision Transformer (MaxViT) and the Improved Multiscale Vision Transformer (MViTv2). It starts by encoding speech signals into Constant-Q Transform (CQT) spectrograms and Mel Spectrograms with Short-Time Fourier Transform (Mel-STFT). The CQT spectrogram is then fed into the MaxViT model, while the Mel-STFT is input to the MViTv2 model to extract informative features from the spectrograms. These features are integrated and passed into a Multilayer Perceptron (MLP) model for final classification. This hybrid model is named the "MaxViT and MViTv2 Fusion Network with Multilayer Perceptron (MaxMViT-MLP)." The MaxMViT-MLP model achieves remarkable results with an accuracy of 95.28% on the Emo-DB, 89.12% on the RAVDESS dataset, and 68.39% on the IEMOCAP dataset, substantiating the advantages of integrating multiple audio feature representations and Vision Transformers in speech emotion recognition.

引用

页码：18237 / 18250

页数：14

共 50 条

[1] Mel-MViTv2: Enhanced Speech Emotion Recognition With Mel Spectrogram and Improved Multiscale Vision Transformers
Ong, Kah Liang
Lee, Chin Poo
Lim, Heng Siong
Lim, Kian Ming
Alqahtani, Ali
IEEE ACCESS, 2023, 11 : 108571 - 108579
[2] Spontaneous Speech Emotion Recognition Using Multiscale Deep Convolutional LSTM
Zhang, Shiqing
Zhao, Xiaoming
Tian, Qi
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2022, 13 (02) : 680 - 688
[3] Speech emotion recognition based on multimodal and multiscale feature fusion
Hu, Huangshui
Wei, Jie
Sun, Hongyu
Wang, Chuhang
Tao, Shuo
SIGNAL IMAGE AND VIDEO PROCESSING, 2025, 19 (01)
[4] SCQT-MaxViT: Speech Emotion Recognition With Constant-Q Transform and Multi-Axis Vision Transformer
Ong, Kah Liang
Lee, Chin Poo
Lim, Heng Siong
Lim, Kian Ming
Mukaida, Takeki
IEEE ACCESS, 2023, 11 : 63081 - 63091
[5] Deep scattering network for speech emotion recognition
Singh, Premjeet
Saha, Goutam
Sahidullah, Md
29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 131 - 135
[6] ViTFER: Facial Emotion Recognition with Vision Transformers
Chaudhari, Aayushi
Bhatt, Chintan
Krishna, Achyut
Mazzeo, Pier Luigi
APPLIED SYSTEM INNOVATION, 2022, 5 (04)
[7] Emotion Recognition via Multiscale Feature Fusion Network and Attention Mechanism
Jiang, Yiye
Xie, Songyun
Xie, Xinzhou
Cui, Yujie
Tang, Hao
IEEE SENSORS JOURNAL, 2023, 23 (10) : 10790 - 10800
[8] Application of probabilistic neural network for speech emotion recognition
Deshmukh S.
Gupta P.
International Journal of Speech Technology, 2024, 27 (01) : 19 - 28
[9] ISNet: Individual Standardization Network for Speech Emotion Recognition
Fan, Weiquan
Xu, Xiangmin
Cai, Bolun
Xing, Xiaofen
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1803 - 1814
[10] Temporal Relation Inference Network for Multimodal Speech Emotion Recognition
Dong, Guan-Nan
Pun, Chi-Man
Zhang, Zheng
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (09) : 6472 - 6485

← 1 2 3 4 5 →