Improving Pre-Trained Model-Based Speech Emotion Recognition From a Low-Level Speech Feature Perspective

被引：1

作者：

Liu, Ke ^{[1
]}

Wei, Jiwei ^{[1
]}

Zou, Jie ^{[1
]}

Wang, Peng ^{[1
]}

Yang, Yang ^{[1
,2
]}

Shen, Heng Tao ^{[1
,3
]}

机构：

[1] Univ Elect Sci & Technol China, Ctr Future Media, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China

[2] Univ Elect Sci & Technol China, Inst Elect & Informat Engn, Dongguan 523808, Peoples R China

[3] Peng Cheng Lab, Shenzhen 518066, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

基金：

中国国家自然科学基金; 中国博士后科学基金;

关键词：

Feature extraction; Mel frequency cepstral coefficient; Task analysis; Speech recognition; Emotion recognition; Representation learning; Fuses; Multi-view learning; discriminative learning; pre-trained model; MFCC; speech emotion recognition; MULTIVIEW; CLASSIFICATION; ATTENTION;

D O I：

10.1109/TMM.2024.3410133

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Multi-view speech emotion recognition (SER) based on the pre-trained model has gained attention in the last two years, which shows great potential in improving the model performance in speaker-independent scenarios. However, the existing work either relies on various fine-tuning methods or uses excessive feature views with complex fusion strategies, causing the increase of complexity with limited performance benefit. In this paper, we improve multi-view SER based on the pre-trained model from the perspective of a low-level speech feature. Specifically, we forgo fine-tuning the pre-trained model and instead focus on learning effective features hidden in the low-level speech feature mel-scale frequency cepstral coefficient (MFCC). We propose a two-stream pooling channel attention (TsPCA) module to discriminatively weight the channel dimensions of the features derived from MFCC. This module enables inter-channel interaction and learning of emotion sequence information across channels. Furthermore, we design a simple but effective feature view fusion strategy to learn robust representations. In the comparison experiments, our method achieves the WA and UA of 73.97%/74.69% and 74.61%/75.66% on the IEMOCAP dataset, 97.21% and 97.11% on the Emo-DB dataset, 77.08% and 77.34% on the RAVDESS dataset, and 74.38% and 71.43% on the SAVEE dataset. Extensive experiments on the four datasets demonstrate that our method consistently surpasses existing methods and achieves a new State-of-the-Art result.

引用

页码：10623 / 10636

页数：14

共 50 条

[21] Speech Emotion Recognition Based on Feature Fusion
Shen, Qi
Chen, Guanggen
Chang, Lin
PROCEEDINGS OF THE 2017 2ND INTERNATIONAL CONFERENCE ON MATERIALS SCIENCE, MACHINERY AND ENERGY ENGINEERING (MSMEE 2017), 2017, 123 : 1071 - 1074
[22] Hybrid LSTM-Transformer Model for Emotion Recognition From Speech Audio Files
Andayani, Felicia
Theng, Lau Bee
Tsun, Mark Teekit
Chua, Caslon
IEEE ACCESS, 2022, 10 : 36018 - 36027
[23] Ensembling Multilingual Pre-Trained Models for Predicting Multi-Label Regression Emotion Share from Speech
Atmaja, Bagus Tris
Sasou, Akira
2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1026 - 1029
[24] Strategies for improving low resource speech to text translation relying on pre-trained ASR models
Kesiraju, Santosh
Sarvas, Marek
Pavlicek, Tomas
Macaire, Cecile
Ciuba, Alejandro
INTERSPEECH 2023, 2023, : 2148 - 2152
[25] Speech Topic Classification Based on Pre-trained and Graph Networks
Niu, Fangjing
Cao, Tengfei
Hu, Ying
Huang, Hao
He, Liang
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1721 - 1726
[26] LoRA-MER: Low-Rank Adaptation of Pre-Trained Speech Models for Multimodal Emotion Recognition Using Mutual Information
Cai, Yunrui
Wu, Zhiyong
Jia, Jia
Meng, Helen
INTERSPEECH 2024, 2024, : 4658 - 4662
[27] Pre-Trained Model-Based NFR Classification: Overcoming Limited Data Challenges
Rahman, Kiramat
Ghani, Anwar
Alzahrani, Abdulrahman
Tariq, Muhammad Usman
Rahman, Arif Ur
IEEE ACCESS, 2023, 11 : 81787 - 81802
[28] Speech Emotion Recognition via Sparse Learning-Based Fusion Model
Min, Dong-Jin
Kim, Deok-Hwan
IEEE ACCESS, 2024, 12 : 177219 - 177235
[29] Speech emotion recognition based on multimodal and multiscale feature fusion
Hu, Huangshui
Wei, Jie
Sun, Hongyu
Wang, Chuhang
Tao, Shuo
SIGNAL IMAGE AND VIDEO PROCESSING, 2025, 19 (01)
[30] Pre-trained Model Based Feature Envy Detection
Ma, Wenhao
Yu, Yaoxiang
Ruan, Xiaoming
Cai, Bo
2023 IEEE/ACM 20TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES, MSR, 2023, : 430 - 440

← 1 2 3 4 5 →