MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition

被引:3
|
作者
Ong, Kah Liang [1 ]
Lee, Chin Poo [1 ]
Lim, Heng Siong [2 ]
Lim, Kian Ming [1 ]
Alqahtani, Ali [3 ,4 ]
机构
[1] Multimedia Univ, Fac Informat Sci & Technol, Melaka 75450, Malaysia
[2] Multimedia Univ, Fac Engn & Technol, Melaka 75450, Malaysia
[3] King Khalid Univ, Dept Comp Sci, Abha 61421, Saudi Arabia
[4] King Khalid Univ, Ctr Artificial Intelligence CAI, Abha 61421, Saudi Arabia
关键词
Speech recognition; Emotion recognition; Spectrogram; Feature extraction; Support vector machines; Transformers; Mel frequency cepstral coefficient; Ensemble learning; Visualization; Speech emotion recognition; ensemble learning; spectrogram; vision transformer; Emo-DB; RAVDESS; IEMOCAP;
D O I
10.1109/ACCESS.2024.3360483
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Vision Transformers, known for their innovative architectural design and modeling capabilities, have gained significant attention in computer vision. This paper presents a dual-path approach that leverages the strengths of the Multi-Axis Vision Transformer (MaxViT) and the Improved Multiscale Vision Transformer (MViTv2). It starts by encoding speech signals into Constant-Q Transform (CQT) spectrograms and Mel Spectrograms with Short-Time Fourier Transform (Mel-STFT). The CQT spectrogram is then fed into the MaxViT model, while the Mel-STFT is input to the MViTv2 model to extract informative features from the spectrograms. These features are integrated and passed into a Multilayer Perceptron (MLP) model for final classification. This hybrid model is named the "MaxViT and MViTv2 Fusion Network with Multilayer Perceptron (MaxMViT-MLP)." The MaxMViT-MLP model achieves remarkable results with an accuracy of 95.28% on the Emo-DB, 89.12% on the RAVDESS dataset, and 68.39% on the IEMOCAP dataset, substantiating the advantages of integrating multiple audio feature representations and Vision Transformers in speech emotion recognition.
引用
收藏
页码:18237 / 18250
页数:14
相关论文
共 50 条
  • [41] Feature fusion Vision Transformers using MLP-Mixer for enhanced deepfake detection
    Essa, Ehab
    NEUROCOMPUTING, 2024, 598
  • [42] A Discriminative Feature Representation Method Based on Cascaded Attention Network With Adversarial Strategy for Speech Emotion Recognition
    Liu, Yang
    Sun, Haoqin
    Guan, Wenbo
    Xia, Yuqi
    Li, Yongwei
    Unoki, Masashi
    Zhao, Zhen
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 1063 - 1074
  • [43] TS-MEFM: A New Multimodal Speech Emotion Recognition Network Based on Speech and Text Fusion
    Wei, Wei
    Zhang, Bingkun
    Wang, Yibing
    MULTIMEDIA MODELING, MMM 2025, PT IV, 2025, 15523 : 454 - 467
  • [44] Adaptive Alignment and Time Aggregation Network for Speech-Visual Emotion Recognition
    Wu, Lile
    Bai, Lei
    Cheng, Wenhao
    Cheng, Zutian
    Chen, Guanghui
    IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 1181 - 1185
  • [45] Speech Emotion Recognition Based on Convolution Neural Network combined with Random Forest
    Zheng, Li
    Li, Qiao
    Ban, Hua
    Liu, Shuhua
    PROCEEDINGS OF THE 30TH CHINESE CONTROL AND DECISION CONFERENCE (2018 CCDC), 2018, : 4143 - 4147
  • [46] SPEECH EMOTION RECOGNITION WITH MULTISCALE AREA ATTENTION AND DATA AUGMENTATION
    Xu, Mingke
    Zhang, Fan
    Cui, Xiaodong
    Zhang, Wei
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6319 - 6323
  • [47] Speech emotion recognition based on multi‐feature and multi‐lingual fusion
    Chunyi Wang
    Ying Ren
    Na Zhang
    Fuwei Cui
    Shiying Luo
    Multimedia Tools and Applications, 2022, 81 : 4897 - 4907
  • [48] Enhanced Speech Emotion Recognition Using the Cognitive Emotion Fusion Network for PTSD Detection with a Novel Hybrid Approach
    Suneetha, Chappidi
    Anitha, Raju
    JOURNAL OF ELECTRICAL SYSTEMS, 2023, 19 (04) : 376 - 398
  • [49] ANN based Decision Fusion for Speech Emotion Recognition
    Xu, Lu
    Xu, Mingxing
    Yang, Dali
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2003 - +
  • [50] Speech Emotion Recognition based on Multiple Feature Fusion
    Jiang, Changjiang
    Mao, Rong
    Liu, Geng
    Wang, Mingyi
    2019 CHINESE AUTOMATION CONGRESS (CAC2019), 2019, : 907 - 912