Improving Pre-Trained Model-Based Speech Emotion Recognition From a Low-Level Speech Feature Perspective

被引:1
|
作者
Liu, Ke [1 ]
Wei, Jiwei [1 ]
Zou, Jie [1 ]
Wang, Peng [1 ]
Yang, Yang [1 ,2 ]
Shen, Heng Tao [1 ,3 ]
机构
[1] Univ Elect Sci & Technol China, Ctr Future Media, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China
[2] Univ Elect Sci & Technol China, Inst Elect & Informat Engn, Dongguan 523808, Peoples R China
[3] Peng Cheng Lab, Shenzhen 518066, Peoples R China
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
Feature extraction; Mel frequency cepstral coefficient; Task analysis; Speech recognition; Emotion recognition; Representation learning; Fuses; Multi-view learning; discriminative learning; pre-trained model; MFCC; speech emotion recognition; MULTIVIEW; CLASSIFICATION; ATTENTION;
D O I
10.1109/TMM.2024.3410133
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Multi-view speech emotion recognition (SER) based on the pre-trained model has gained attention in the last two years, which shows great potential in improving the model performance in speaker-independent scenarios. However, the existing work either relies on various fine-tuning methods or uses excessive feature views with complex fusion strategies, causing the increase of complexity with limited performance benefit. In this paper, we improve multi-view SER based on the pre-trained model from the perspective of a low-level speech feature. Specifically, we forgo fine-tuning the pre-trained model and instead focus on learning effective features hidden in the low-level speech feature mel-scale frequency cepstral coefficient (MFCC). We propose a two-stream pooling channel attention (TsPCA) module to discriminatively weight the channel dimensions of the features derived from MFCC. This module enables inter-channel interaction and learning of emotion sequence information across channels. Furthermore, we design a simple but effective feature view fusion strategy to learn robust representations. In the comparison experiments, our method achieves the WA and UA of 73.97%/74.69% and 74.61%/75.66% on the IEMOCAP dataset, 97.21% and 97.11% on the Emo-DB dataset, 77.08% and 77.34% on the RAVDESS dataset, and 74.38% and 71.43% on the SAVEE dataset. Extensive experiments on the four datasets demonstrate that our method consistently surpasses existing methods and achieves a new State-of-the-Art result.
引用
收藏
页码:10623 / 10636
页数:14
相关论文
共 50 条
  • [1] Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition
    Zhang, Hua
    Gou, Ruoyun
    Shang, Jili
    Shen, Fangyao
    Wu, Yifan
    Dai, Guojun
    FRONTIERS IN PHYSIOLOGY, 2021, 12
  • [2] Personalized Adaptation with Pre-trained Speech Encoders for Continuous Emotion Recognition
    Minh Tran
    Yin, Yufeng
    Soleymani, Mohammad
    INTERSPEECH 2023, 2023, : 636 - 640
  • [3] On the Usage of Pre-Trained Speech Recognition Deep Layers to Detect Emotions
    Oliveira, Jorge
    Praca, Isabel
    IEEE ACCESS, 2021, 9 : 9699 - 9705
  • [4] Classification of Speech Emotion State Based on Feature Map Fusion of TCN and Pretrained CNN Model From Korean Speech Emotion Data
    Jo, A-Hyeon
    Kwak, Keun-Chang
    IEEE ACCESS, 2025, 13 : 19947 - 19963
  • [5] MTLSER: Multi-task learning enhanced speech emotion recognition with pre-trained acoustic model
    Chen, Zengzhao
    Liu, Chuan
    Wang, Zhifeng
    Zhao, Chuanxu
    Lin, Mengting
    Zheng, Qiuyu
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 273
  • [6] Interpretabilty of Speech Emotion Recognition modelled using Self-Supervised Speech and Text Pre-Trained Embeddings
    Girish, K. V. Vijay
    Konjeti, Srikanth
    Vepa, Jithendra
    INTERSPEECH 2022, 2022, : 4496 - 4500
  • [7] IMPROVING CTC-BASED SPEECH RECOGNITION VIA KNOWLEDGE TRANSFERRING FROM PRE-TRAINED LANGUAGE MODELS
    Deng, Keqi
    Cao, Songjun
    Zhang, Yike
    Ma, Long
    Cheng, Gaofeng
    Xu, Ji
    Zhang, Pengyuan
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8517 - 8521
  • [8] FEATURE EXTRACTION USING PRE-TRAINED CONVOLUTIVE BOTTLENECK NETS FOR DYSARTHRIC SPEECH RECOGNITION
    Takashima, Yuki
    Nakashika, Toru
    Takiguchi, Tetsuya
    Ariki, Yasuo
    2015 23RD EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2015, : 1411 - 1415
  • [9] PEFT-SER: On the Use of Parameter Efficient Transfer Learning Approaches For Speech Emotion Recognition Using Pre-trained Speech Models
    Feng, Tiantian
    Narayanan, Shrikanth
    2023 11TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION, ACII, 2023,
  • [10] Efficient Feature-Aware Hybrid Model of Deep Learning Architectures for Speech Emotion Recognition
    Ezz-Eldin, Mai
    Khalaf, Ashraf A. M.
    Hamed, Hesham F. A.
    Hussein, Aziza, I
    IEEE ACCESS, 2021, 9 : 19999 - 20011