Improving Pre-Trained Model-Based Speech Emotion Recognition From a Low-Level Speech Feature Perspective

被引：1

作者：

Liu, Ke ^{[1
]}

Wei, Jiwei ^{[1
]}

Zou, Jie ^{[1
]}

Wang, Peng ^{[1
]}

Yang, Yang ^{[1
,2
]}

Shen, Heng Tao ^{[1
,3
]}

机构：

[1] Univ Elect Sci & Technol China, Ctr Future Media, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China

[2] Univ Elect Sci & Technol China, Inst Elect & Informat Engn, Dongguan 523808, Peoples R China

[3] Peng Cheng Lab, Shenzhen 518066, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

基金：

中国国家自然科学基金; 中国博士后科学基金;

关键词：

Feature extraction; Mel frequency cepstral coefficient; Task analysis; Speech recognition; Emotion recognition; Representation learning; Fuses; Multi-view learning; discriminative learning; pre-trained model; MFCC; speech emotion recognition; MULTIVIEW; CLASSIFICATION; ATTENTION;

D O I：

10.1109/TMM.2024.3410133

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Multi-view speech emotion recognition (SER) based on the pre-trained model has gained attention in the last two years, which shows great potential in improving the model performance in speaker-independent scenarios. However, the existing work either relies on various fine-tuning methods or uses excessive feature views with complex fusion strategies, causing the increase of complexity with limited performance benefit. In this paper, we improve multi-view SER based on the pre-trained model from the perspective of a low-level speech feature. Specifically, we forgo fine-tuning the pre-trained model and instead focus on learning effective features hidden in the low-level speech feature mel-scale frequency cepstral coefficient (MFCC). We propose a two-stream pooling channel attention (TsPCA) module to discriminatively weight the channel dimensions of the features derived from MFCC. This module enables inter-channel interaction and learning of emotion sequence information across channels. Furthermore, we design a simple but effective feature view fusion strategy to learn robust representations. In the comparison experiments, our method achieves the WA and UA of 73.97%/74.69% and 74.61%/75.66% on the IEMOCAP dataset, 97.21% and 97.11% on the Emo-DB dataset, 77.08% and 77.34% on the RAVDESS dataset, and 74.38% and 71.43% on the SAVEE dataset. Extensive experiments on the four datasets demonstrate that our method consistently surpasses existing methods and achieves a new State-of-the-Art result.

引用

页码：10623 / 10636

页数：14

共 50 条

[1] Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition
Zhang, Hua
Gou, Ruoyun
Shang, Jili
Shen, Fangyao
Wu, Yifan
Dai, Guojun
FRONTIERS IN PHYSIOLOGY, 2021, 12
[2] Personalized Adaptation with Pre-trained Speech Encoders for Continuous Emotion Recognition
Minh Tran
Yin, Yufeng
Soleymani, Mohammad
INTERSPEECH 2023, 2023, : 636 - 640
[3] On the Usage of Pre-Trained Speech Recognition Deep Layers to Detect Emotions
Oliveira, Jorge
Praca, Isabel
IEEE ACCESS, 2021, 9 : 9699 - 9705
[4] Classification of Speech Emotion State Based on Feature Map Fusion of TCN and Pretrained CNN Model From Korean Speech Emotion Data
Jo, A-Hyeon
Kwak, Keun-Chang
IEEE ACCESS, 2025, 13 : 19947 - 19963
[5] MTLSER: Multi-task learning enhanced speech emotion recognition with pre-trained acoustic model
Chen, Zengzhao
Liu, Chuan
Wang, Zhifeng
Zhao, Chuanxu
Lin, Mengting
Zheng, Qiuyu
EXPERT SYSTEMS WITH APPLICATIONS, 2025, 273
[6] Interpretabilty of Speech Emotion Recognition modelled using Self-Supervised Speech and Text Pre-Trained Embeddings
Girish, K. V. Vijay
Konjeti, Srikanth
Vepa, Jithendra
INTERSPEECH 2022, 2022, : 4496 - 4500
[7] IMPROVING CTC-BASED SPEECH RECOGNITION VIA KNOWLEDGE TRANSFERRING FROM PRE-TRAINED LANGUAGE MODELS
Deng, Keqi
Cao, Songjun
Zhang, Yike
Ma, Long
Cheng, Gaofeng
Xu, Ji
Zhang, Pengyuan
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8517 - 8521
[8] FEATURE EXTRACTION USING PRE-TRAINED CONVOLUTIVE BOTTLENECK NETS FOR DYSARTHRIC SPEECH RECOGNITION
Takashima, Yuki
Nakashika, Toru
Takiguchi, Tetsuya
Ariki, Yasuo
2015 23RD EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2015, : 1411 - 1415
[9] PEFT-SER: On the Use of Parameter Efficient Transfer Learning Approaches For Speech Emotion Recognition Using Pre-trained Speech Models
Feng, Tiantian
Narayanan, Shrikanth
2023 11TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION, ACII, 2023,
[10] Efficient Feature-Aware Hybrid Model of Deep Learning Architectures for Speech Emotion Recognition
Ezz-Eldin, Mai
Khalaf, Ashraf A. M.
Hamed, Hesham F. A.
Hussein, Aziza, I
IEEE ACCESS, 2021, 9 : 19999 - 20011

← 1 2 3 4 5 →