Mel-MViTv2: Enhanced Speech Emotion Recognition With Mel Spectrogram and Improved Multiscale Vision Transformers

被引:7
作者
Ong, Kah Liang [1 ]
Lee, Chin Poo [1 ]
Lim, Heng Siong [2 ]
Lim, Kian Ming [1 ]
Alqahtani, Ali [3 ,4 ]
机构
[1] Multimedia Univ, Fac Informat Sci & Technol, Malacca 75450, Malaysia
[2] Multimedia Univ, Fac Engn & Technol, Malacca 75450, Malaysia
[3] King Khalid Univ, Dept Comp Sci, Abha 61421, Saudi Arabia
[4] King Khalid Univ, Ctr Artificial Intelligence CAI, Abha 61421, Saudi Arabia
关键词
Speech; speech emotion; speech emotion recognition; spectrogram; mel spectrogram; mel spectrogram with short-time Fourier transform; vision transformer; improved multiscale vision transformers; Emo-DB; RAVDESS; IEMOCAP;
D O I
10.1109/ACCESS.2023.3321122
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speech emotion recognition aims to automatically identify and classify emotions from speech signals. It plays a crucial role in various applications such as human-computer interaction, affective computing, and social robotics. Over the years, researchers have proposed different approaches for speech emotion recognition, leveraging various classifiers and features. However, despite the advancements, existing methods in speech emotion recognition still have certain limitations. Some approaches rely on handcrafted features that may not capture the full complexity of emotional information present in speech signals, while others may suffer from a lack of robustness and generalization when applied to different datasets. To address these challenges, this paper proposes a speech emotion recognition method that combines Mel spectrogram with Short-Term Fourier Transform (Mel-STFT) and the Improved Multiscale Vision Transformers (MViTv2). The Mel-STFT spectrograms capture both the frequency and temporal information of speech signals, providing a more comprehensive representation of the emotional content. The MViTv2 classifier introduces multi-scale visual modeling with different stages and pooling attention mechanisms. MViTv2 incorporates relative positional embeddings and a residual pooling connection to effectively model the interactions between tokens in the space-time structure, preserve essential information, and improve the efficiency of the model. Experimental results demonstrate that the proposed method generalizes well on different datasets, achieving an accuracy of 91.51% on the Emo-DB dataset, 81.75% on the RAVDESS dataset, and 64.03% on the IEMOCAP dataset.
引用
收藏
页码:108571 / 108579
页数:9
相关论文
共 19 条
  • [1] Recognition of Emotion in Speech-related Audio Files with LSTM-Transformer
    Andayani, Felicia
    Theng, Lau Bee
    Tsun, Mark TeeKit
    Chua, Caslon
    [J]. 5TH INTERNATIONAL CONFERENCE ON COMPUTING AND INFORMATICS (ICCI 2022), 2022, : 87 - 91
  • [2] [Anonymous], 2019, Social media and machine learning, DOI DOI 10.5772/INTECHOPEN.84856
  • [3] Burkhardt F., 2005, P INT, DOI DOI 10.21437/INTERSPEECH.2005-446
  • [4] IEMOCAP: interactive emotional dyadic motion capture database
    Busso, Carlos
    Bulut, Murtaza
    Lee, Chi-Chun
    Kazemzadeh, Abe
    Mower, Emily
    Kim, Samuel
    Chang, Jeannette N.
    Lee, Sungbok
    Narayanan, Shrikanth S.
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
  • [5] Multiscale Vision Transformers
    Fan, Haoqi
    Xiong, Bo
    Mangalam, Karttikeya
    Li, Yanghao
    Yan, Zhicheng
    Malik, Jitendra
    Feichtenhofer, Christoph
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6804 - 6815
  • [6] Machine learning techniques for speech emotion recognition using paralinguistic acoustic features
    Jha T.
    Kavya R.
    Christopher J.
    Arunachalam V.
    [J]. International Journal of Speech Technology, 2022, 25 (03): : 707 - 725
  • [7] Kingma DP., 2014, ARXIV, DOI DOI 10.48550/ARXIV.1412.6980
  • [8] Latif S, 2020, Arxiv, DOI arXiv:1801.06353
  • [9] MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
    Li, Yanghao
    Wu, Chao-Yuan
    Fan, Haoqi
    Mangalam, Karttikeya
    Xiong, Bo
    Malik, Jitendra
    Feichtenhofer, Christoph
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 4794 - 4804
  • [10] Liu LY, 2021, Arxiv, DOI arXiv:1908.03265