MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition

被引:3
|
作者
Ong, Kah Liang [1 ]
Lee, Chin Poo [1 ]
Lim, Heng Siong [2 ]
Lim, Kian Ming [1 ]
Alqahtani, Ali [3 ,4 ]
机构
[1] Multimedia Univ, Fac Informat Sci & Technol, Melaka 75450, Malaysia
[2] Multimedia Univ, Fac Engn & Technol, Melaka 75450, Malaysia
[3] King Khalid Univ, Dept Comp Sci, Abha 61421, Saudi Arabia
[4] King Khalid Univ, Ctr Artificial Intelligence CAI, Abha 61421, Saudi Arabia
关键词
Speech recognition; Emotion recognition; Spectrogram; Feature extraction; Support vector machines; Transformers; Mel frequency cepstral coefficient; Ensemble learning; Visualization; Speech emotion recognition; ensemble learning; spectrogram; vision transformer; Emo-DB; RAVDESS; IEMOCAP;
D O I
10.1109/ACCESS.2024.3360483
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Vision Transformers, known for their innovative architectural design and modeling capabilities, have gained significant attention in computer vision. This paper presents a dual-path approach that leverages the strengths of the Multi-Axis Vision Transformer (MaxViT) and the Improved Multiscale Vision Transformer (MViTv2). It starts by encoding speech signals into Constant-Q Transform (CQT) spectrograms and Mel Spectrograms with Short-Time Fourier Transform (Mel-STFT). The CQT spectrogram is then fed into the MaxViT model, while the Mel-STFT is input to the MViTv2 model to extract informative features from the spectrograms. These features are integrated and passed into a Multilayer Perceptron (MLP) model for final classification. This hybrid model is named the "MaxViT and MViTv2 Fusion Network with Multilayer Perceptron (MaxMViT-MLP)." The MaxMViT-MLP model achieves remarkable results with an accuracy of 95.28% on the Emo-DB, 89.12% on the RAVDESS dataset, and 68.39% on the IEMOCAP dataset, substantiating the advantages of integrating multiple audio feature representations and Vision Transformers in speech emotion recognition.
引用
收藏
页码:18237 / 18250
页数:14
相关论文
共 50 条
  • [1] Mel-MViTv2: Enhanced Speech Emotion Recognition With Mel Spectrogram and Improved Multiscale Vision Transformers
    Ong, Kah Liang
    Lee, Chin Poo
    Lim, Heng Siong
    Lim, Kian Ming
    Alqahtani, Ali
    IEEE ACCESS, 2023, 11 : 108571 - 108579
  • [2] Spontaneous Speech Emotion Recognition Using Multiscale Deep Convolutional LSTM
    Zhang, Shiqing
    Zhao, Xiaoming
    Tian, Qi
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2022, 13 (02) : 680 - 688
  • [3] Speech emotion recognition based on multimodal and multiscale feature fusion
    Hu, Huangshui
    Wei, Jie
    Sun, Hongyu
    Wang, Chuhang
    Tao, Shuo
    SIGNAL IMAGE AND VIDEO PROCESSING, 2025, 19 (01)
  • [4] SCQT-MaxViT: Speech Emotion Recognition With Constant-Q Transform and Multi-Axis Vision Transformer
    Ong, Kah Liang
    Lee, Chin Poo
    Lim, Heng Siong
    Lim, Kian Ming
    Mukaida, Takeki
    IEEE ACCESS, 2023, 11 : 63081 - 63091
  • [5] Deep scattering network for speech emotion recognition
    Singh, Premjeet
    Saha, Goutam
    Sahidullah, Md
    29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 131 - 135
  • [6] ViTFER: Facial Emotion Recognition with Vision Transformers
    Chaudhari, Aayushi
    Bhatt, Chintan
    Krishna, Achyut
    Mazzeo, Pier Luigi
    APPLIED SYSTEM INNOVATION, 2022, 5 (04)
  • [7] Emotion Recognition via Multiscale Feature Fusion Network and Attention Mechanism
    Jiang, Yiye
    Xie, Songyun
    Xie, Xinzhou
    Cui, Yujie
    Tang, Hao
    IEEE SENSORS JOURNAL, 2023, 23 (10) : 10790 - 10800
  • [8] Application of probabilistic neural network for speech emotion recognition
    Deshmukh S.
    Gupta P.
    International Journal of Speech Technology, 2024, 27 (01) : 19 - 28
  • [9] ISNet: Individual Standardization Network for Speech Emotion Recognition
    Fan, Weiquan
    Xu, Xiangmin
    Cai, Bolun
    Xing, Xiaofen
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1803 - 1814
  • [10] Temporal Relation Inference Network for Multimodal Speech Emotion Recognition
    Dong, Guan-Nan
    Pun, Chi-Man
    Zhang, Zheng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (09) : 6472 - 6485