MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition

被引:3
|
作者
Ong, Kah Liang [1 ]
Lee, Chin Poo [1 ]
Lim, Heng Siong [2 ]
Lim, Kian Ming [1 ]
Alqahtani, Ali [3 ,4 ]
机构
[1] Multimedia Univ, Fac Informat Sci & Technol, Melaka 75450, Malaysia
[2] Multimedia Univ, Fac Engn & Technol, Melaka 75450, Malaysia
[3] King Khalid Univ, Dept Comp Sci, Abha 61421, Saudi Arabia
[4] King Khalid Univ, Ctr Artificial Intelligence CAI, Abha 61421, Saudi Arabia
关键词
Speech recognition; Emotion recognition; Spectrogram; Feature extraction; Support vector machines; Transformers; Mel frequency cepstral coefficient; Ensemble learning; Visualization; Speech emotion recognition; ensemble learning; spectrogram; vision transformer; Emo-DB; RAVDESS; IEMOCAP;
D O I
10.1109/ACCESS.2024.3360483
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Vision Transformers, known for their innovative architectural design and modeling capabilities, have gained significant attention in computer vision. This paper presents a dual-path approach that leverages the strengths of the Multi-Axis Vision Transformer (MaxViT) and the Improved Multiscale Vision Transformer (MViTv2). It starts by encoding speech signals into Constant-Q Transform (CQT) spectrograms and Mel Spectrograms with Short-Time Fourier Transform (Mel-STFT). The CQT spectrogram is then fed into the MaxViT model, while the Mel-STFT is input to the MViTv2 model to extract informative features from the spectrograms. These features are integrated and passed into a Multilayer Perceptron (MLP) model for final classification. This hybrid model is named the "MaxViT and MViTv2 Fusion Network with Multilayer Perceptron (MaxMViT-MLP)." The MaxMViT-MLP model achieves remarkable results with an accuracy of 95.28% on the Emo-DB, 89.12% on the RAVDESS dataset, and 68.39% on the IEMOCAP dataset, substantiating the advantages of integrating multiple audio feature representations and Vision Transformers in speech emotion recognition.
引用
收藏
页码:18237 / 18250
页数:14
相关论文
共 50 条
  • [31] HIERARCHICAL NETWORK BASED ON THE FUSION OF STATIC AND DYNAMIC FEATURES FOR SPEECH EMOTION RECOGNITION
    Cao, Qi
    Hou, Mixiao
    Chen, Bingzhi
    Zhang, Zheng
    Lu, Guangming
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6334 - 6338
  • [32] A dynamic-static feature fusion learning network for speech emotion recognition
    Xue, Peiyun
    Gao, Xiang
    Bai, Jing
    Dong, Zhenan
    Wang, Zhiyu
    Xu, Jiangshuai
    NEUROCOMPUTING, 2025, 633
  • [33] Feature Fusion of Speech Emotion Recognition Based on Deep Learning
    Liu, Gang
    He, Wei
    Jin, Bicheng
    PROCEEDINGS OF 2018 INTERNATIONAL CONFERENCE ON NETWORK INFRASTRUCTURE AND DIGITAL CONTENT (IEEE IC-NIDC), 2018, : 193 - 197
  • [34] Comparison of Neural Network Models for Speech Emotion Recognition
    Palo, Hemanta Kumar
    Sagar, Sangeet
    2ND INTERNATIONAL CONFERENCE ON DATA SCIENCE AND BUSINESS ANALYTICS (ICDSBA 2018), 2018, : 127 - 131
  • [35] Speech Emotion Recognition Based on Feature Fusion
    Shen, Qi
    Chen, Guanggen
    Chang, Lin
    PROCEEDINGS OF THE 2017 2ND INTERNATIONAL CONFERENCE ON MATERIALS SCIENCE, MACHINERY AND ENERGY ENGINEERING (MSMEE 2017), 2017, 123 : 1071 - 1074
  • [36] Use of Different Features for Emotion Recognition Using MLP Network
    Palo, H. K.
    Mohanty, Mihir Narayana
    Chandra, Mahesh
    COMPUTATIONAL VISION AND ROBOTICS, 2015, 332 : 7 - 15
  • [37] Anchor Model Fusion for Emotion Recognition in Speech
    Ortego-Resa, Carlos
    Lopez-Moreno, Ignacio
    Ramos, Daniel
    Gonzalez-Rodriguez, Joaquin
    BIOMETRIC ID MANAGEMENT AND MULTIMODAL COMMUNICATION, PROCEEDINGS, 2009, 5707 : 49 - 56
  • [38] TDFNet: Transformer-Based Deep-Scale Fusion Network for Multimodal Emotion Recognition
    Zhao, Zhengdao
    Wang, Yuhua
    Shen, Guang
    Xu, Yuezhu
    Zhang, Jiayuan
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 3771 - 3782
  • [39] Speech Emotion Recognition Based on Deep Belief Network
    Shi, Peng
    2018 IEEE 15TH INTERNATIONAL CONFERENCE ON NETWORKING, SENSING AND CONTROL (ICNSC), 2018,
  • [40] Head Fusion: Improving the Accuracy and Robustness of Speech Emotion Recognition on the IEMOCAP and RAVDESS Dataset
    Xu, Mingke
    Zhang, Fan
    Zhang, Wei
    IEEE ACCESS, 2021, 9 : 74539 - 74549