Improving Pre-Trained Model-Based Speech Emotion Recognition From a Low-Level Speech Feature Perspective

被引:1
|
作者
Liu, Ke [1 ]
Wei, Jiwei [1 ]
Zou, Jie [1 ]
Wang, Peng [1 ]
Yang, Yang [1 ,2 ]
Shen, Heng Tao [1 ,3 ]
机构
[1] Univ Elect Sci & Technol China, Ctr Future Media, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China
[2] Univ Elect Sci & Technol China, Inst Elect & Informat Engn, Dongguan 523808, Peoples R China
[3] Peng Cheng Lab, Shenzhen 518066, Peoples R China
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
Feature extraction; Mel frequency cepstral coefficient; Task analysis; Speech recognition; Emotion recognition; Representation learning; Fuses; Multi-view learning; discriminative learning; pre-trained model; MFCC; speech emotion recognition; MULTIVIEW; CLASSIFICATION; ATTENTION;
D O I
10.1109/TMM.2024.3410133
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Multi-view speech emotion recognition (SER) based on the pre-trained model has gained attention in the last two years, which shows great potential in improving the model performance in speaker-independent scenarios. However, the existing work either relies on various fine-tuning methods or uses excessive feature views with complex fusion strategies, causing the increase of complexity with limited performance benefit. In this paper, we improve multi-view SER based on the pre-trained model from the perspective of a low-level speech feature. Specifically, we forgo fine-tuning the pre-trained model and instead focus on learning effective features hidden in the low-level speech feature mel-scale frequency cepstral coefficient (MFCC). We propose a two-stream pooling channel attention (TsPCA) module to discriminatively weight the channel dimensions of the features derived from MFCC. This module enables inter-channel interaction and learning of emotion sequence information across channels. Furthermore, we design a simple but effective feature view fusion strategy to learn robust representations. In the comparison experiments, our method achieves the WA and UA of 73.97%/74.69% and 74.61%/75.66% on the IEMOCAP dataset, 97.21% and 97.11% on the Emo-DB dataset, 77.08% and 77.34% on the RAVDESS dataset, and 74.38% and 71.43% on the SAVEE dataset. Extensive experiments on the four datasets demonstrate that our method consistently surpasses existing methods and achieves a new State-of-the-Art result.
引用
收藏
页码:10623 / 10636
页数:14
相关论文
共 50 条
  • [21] Speech Emotion Recognition Based on Feature Fusion
    Shen, Qi
    Chen, Guanggen
    Chang, Lin
    PROCEEDINGS OF THE 2017 2ND INTERNATIONAL CONFERENCE ON MATERIALS SCIENCE, MACHINERY AND ENERGY ENGINEERING (MSMEE 2017), 2017, 123 : 1071 - 1074
  • [22] Hybrid LSTM-Transformer Model for Emotion Recognition From Speech Audio Files
    Andayani, Felicia
    Theng, Lau Bee
    Tsun, Mark Teekit
    Chua, Caslon
    IEEE ACCESS, 2022, 10 : 36018 - 36027
  • [23] Ensembling Multilingual Pre-Trained Models for Predicting Multi-Label Regression Emotion Share from Speech
    Atmaja, Bagus Tris
    Sasou, Akira
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1026 - 1029
  • [24] Strategies for improving low resource speech to text translation relying on pre-trained ASR models
    Kesiraju, Santosh
    Sarvas, Marek
    Pavlicek, Tomas
    Macaire, Cecile
    Ciuba, Alejandro
    INTERSPEECH 2023, 2023, : 2148 - 2152
  • [25] Speech Topic Classification Based on Pre-trained and Graph Networks
    Niu, Fangjing
    Cao, Tengfei
    Hu, Ying
    Huang, Hao
    He, Liang
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1721 - 1726
  • [26] LoRA-MER: Low-Rank Adaptation of Pre-Trained Speech Models for Multimodal Emotion Recognition Using Mutual Information
    Cai, Yunrui
    Wu, Zhiyong
    Jia, Jia
    Meng, Helen
    INTERSPEECH 2024, 2024, : 4658 - 4662
  • [27] Pre-Trained Model-Based NFR Classification: Overcoming Limited Data Challenges
    Rahman, Kiramat
    Ghani, Anwar
    Alzahrani, Abdulrahman
    Tariq, Muhammad Usman
    Rahman, Arif Ur
    IEEE ACCESS, 2023, 11 : 81787 - 81802
  • [28] Speech Emotion Recognition via Sparse Learning-Based Fusion Model
    Min, Dong-Jin
    Kim, Deok-Hwan
    IEEE ACCESS, 2024, 12 : 177219 - 177235
  • [29] Speech emotion recognition based on multimodal and multiscale feature fusion
    Hu, Huangshui
    Wei, Jie
    Sun, Hongyu
    Wang, Chuhang
    Tao, Shuo
    SIGNAL IMAGE AND VIDEO PROCESSING, 2025, 19 (01)
  • [30] Pre-trained Model Based Feature Envy Detection
    Ma, Wenhao
    Yu, Yaoxiang
    Ruan, Xiaoming
    Cai, Bo
    2023 IEEE/ACM 20TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES, MSR, 2023, : 430 - 440