MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition
被引:3
|
作者:
Ong, Kah Liang
论文数: 0引用数: 0
h-index: 0
机构:
Multimedia Univ, Fac Informat Sci & Technol, Melaka 75450, MalaysiaMultimedia Univ, Fac Informat Sci & Technol, Melaka 75450, Malaysia
Ong, Kah Liang
[1
]
Lee, Chin Poo
论文数: 0引用数: 0
h-index: 0
机构:
Multimedia Univ, Fac Informat Sci & Technol, Melaka 75450, MalaysiaMultimedia Univ, Fac Informat Sci & Technol, Melaka 75450, Malaysia
Lee, Chin Poo
[1
]
Lim, Heng Siong
论文数: 0引用数: 0
h-index: 0
机构:
Multimedia Univ, Fac Engn & Technol, Melaka 75450, MalaysiaMultimedia Univ, Fac Informat Sci & Technol, Melaka 75450, Malaysia
Lim, Heng Siong
[2
]
Lim, Kian Ming
论文数: 0引用数: 0
h-index: 0
机构:
Multimedia Univ, Fac Informat Sci & Technol, Melaka 75450, MalaysiaMultimedia Univ, Fac Informat Sci & Technol, Melaka 75450, Malaysia
Lim, Kian Ming
[1
]
Alqahtani, Ali
论文数: 0引用数: 0
h-index: 0
机构:
King Khalid Univ, Dept Comp Sci, Abha 61421, Saudi Arabia
King Khalid Univ, Ctr Artificial Intelligence CAI, Abha 61421, Saudi ArabiaMultimedia Univ, Fac Informat Sci & Technol, Melaka 75450, Malaysia
Alqahtani, Ali
[3
,4
]
机构:
[1] Multimedia Univ, Fac Informat Sci & Technol, Melaka 75450, Malaysia
[2] Multimedia Univ, Fac Engn & Technol, Melaka 75450, Malaysia
[3] King Khalid Univ, Dept Comp Sci, Abha 61421, Saudi Arabia
[4] King Khalid Univ, Ctr Artificial Intelligence CAI, Abha 61421, Saudi Arabia
Vision Transformers, known for their innovative architectural design and modeling capabilities, have gained significant attention in computer vision. This paper presents a dual-path approach that leverages the strengths of the Multi-Axis Vision Transformer (MaxViT) and the Improved Multiscale Vision Transformer (MViTv2). It starts by encoding speech signals into Constant-Q Transform (CQT) spectrograms and Mel Spectrograms with Short-Time Fourier Transform (Mel-STFT). The CQT spectrogram is then fed into the MaxViT model, while the Mel-STFT is input to the MViTv2 model to extract informative features from the spectrograms. These features are integrated and passed into a Multilayer Perceptron (MLP) model for final classification. This hybrid model is named the "MaxViT and MViTv2 Fusion Network with Multilayer Perceptron (MaxMViT-MLP)." The MaxMViT-MLP model achieves remarkable results with an accuracy of 95.28% on the Emo-DB, 89.12% on the RAVDESS dataset, and 68.39% on the IEMOCAP dataset, substantiating the advantages of integrating multiple audio feature representations and Vision Transformers in speech emotion recognition.
机构:
Research Scholar, Amity University, Mumbai
Amity University, Mumbai
Assistant Professor, College of Engineering, Bharati Vidyapeeth (Deemed to be University), PuneResearch Scholar, Amity University, Mumbai
机构:
South China Univ Technol, Sch Elect & Informat, Guangzhou 510640, Peoples R ChinaSouth China Univ Technol, Sch Elect & Informat, Guangzhou 510640, Peoples R China
Fan, Weiquan
Xu, Xiangmin
论文数: 0引用数: 0
h-index: 0
机构:
South China Univ Technol, Sch Elect & Informat, Guangzhou 510640, Peoples R ChinaSouth China Univ Technol, Sch Elect & Informat, Guangzhou 510640, Peoples R China
Xu, Xiangmin
Cai, Bolun
论文数: 0引用数: 0
h-index: 0
机构:
South China Univ Technol, Sch Elect & Informat, Guangzhou 510640, Peoples R ChinaSouth China Univ Technol, Sch Elect & Informat, Guangzhou 510640, Peoples R China
Cai, Bolun
Xing, Xiaofen
论文数: 0引用数: 0
h-index: 0
机构:
South China Univ Technol, Sch Elect & Informat, Guangzhou 510640, Peoples R ChinaSouth China Univ Technol, Sch Elect & Informat, Guangzhou 510640, Peoples R China