Improving Pre-Trained Model-Based Speech Emotion Recognition From a Low-Level Speech Feature Perspective

被引：1

作者：

Liu, Ke ^{[1
]}

Wei, Jiwei ^{[1
]}

Zou, Jie ^{[1
]}

Wang, Peng ^{[1
]}

Yang, Yang ^{[1
,2
]}

Shen, Heng Tao ^{[1
,3
]}

机构：

[1] Univ Elect Sci & Technol China, Ctr Future Media, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China

[2] Univ Elect Sci & Technol China, Inst Elect & Informat Engn, Dongguan 523808, Peoples R China

[3] Peng Cheng Lab, Shenzhen 518066, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

基金：

中国国家自然科学基金; 中国博士后科学基金;

关键词：

Feature extraction; Mel frequency cepstral coefficient; Task analysis; Speech recognition; Emotion recognition; Representation learning; Fuses; Multi-view learning; discriminative learning; pre-trained model; MFCC; speech emotion recognition; MULTIVIEW; CLASSIFICATION; ATTENTION;

D O I：

10.1109/TMM.2024.3410133

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Multi-view speech emotion recognition (SER) based on the pre-trained model has gained attention in the last two years, which shows great potential in improving the model performance in speaker-independent scenarios. However, the existing work either relies on various fine-tuning methods or uses excessive feature views with complex fusion strategies, causing the increase of complexity with limited performance benefit. In this paper, we improve multi-view SER based on the pre-trained model from the perspective of a low-level speech feature. Specifically, we forgo fine-tuning the pre-trained model and instead focus on learning effective features hidden in the low-level speech feature mel-scale frequency cepstral coefficient (MFCC). We propose a two-stream pooling channel attention (TsPCA) module to discriminatively weight the channel dimensions of the features derived from MFCC. This module enables inter-channel interaction and learning of emotion sequence information across channels. Furthermore, we design a simple but effective feature view fusion strategy to learn robust representations. In the comparison experiments, our method achieves the WA and UA of 73.97%/74.69% and 74.61%/75.66% on the IEMOCAP dataset, 97.21% and 97.11% on the Emo-DB dataset, 77.08% and 77.34% on the RAVDESS dataset, and 74.38% and 71.43% on the SAVEE dataset. Extensive experiments on the four datasets demonstrate that our method consistently surpasses existing methods and achieves a new State-of-the-Art result.

引用

页码：10623 / 10636

页数：14

共 50 条

[41] Feature extraction analysis method of pre-trained CNN model for SAR target recognition
Zheng, Tong
Feng, Wenbin
Yu, Chongchong
Wu, Qing
INTERNATIONAL JOURNAL OF REMOTE SENSING, 2023, 44 (07) : 2294 - 2316
[42] XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech
Nguyen, Linh The
Pham, Thinh
Nguyen, Dat Quoc
INTERSPEECH 2023, 2023, : 5506 - 5510
[43] A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition
Tu, Zhongwen
Liu, Bin
Zhao, Wei
Yan, Raoxin
Zou, Yang
APPLIED SCIENCES-BASEL, 2023, 13 (07):
[44] Empirical Interpretation of Speech Emotion Perception with Attention Based Model for Speech Emotion Recognition
Jalal, Md Asif
Milner, Rosanna
Hain, Thomas
INTERSPEECH 2020, 2020, : 4113 - 4117
[45] Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
Dahl, George E.
Yu, Dong
Deng, Li
Acero, Alex
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01): : 30 - 42
[46] Improving Under-Resourced Code-Switched Speech Recognition: Large Pre-trained Models or Architectural Interventions
van Vuren, Joshua Jansen
Niesler, Thomas
INTERSPEECH 2023, 2023, : 1439 - 1443
[47] Feature selection enhancement and feature space visualization for speech-based emotion recognition
Kanwal, Sofia
Asghar, Sohail
Ali, Hazrat
PEERJ COMPUTER SCIENCE, 2022, 8
[48] Improving Automatic Emotion Recognition from Speech Signals
Bozkurt, Elif
Erzin, Engin
Erdem, Cigdem Eroglu
Erdem, A. Tanju
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 312 - +
[49] Prosodic Feature-Based Discriminatively Trained Low Resource Speech Recognition System
Hasija, Taniya
Kadyan, Virender
Guleria, Kalpna
Alharbi, Abdullah
Alyami, Hashem
Goyal, Nitin
SUSTAINABILITY, 2022, 14 (02)
[50] Improving speech emotion recognition based on acoustic words emotion dictionary
Wei, Wang
Cao, Xinyi
Li, He
Shen, Lingjie
Feng, Yaqin
Watters, Paul A.
NATURAL LANGUAGE ENGINEERING, 2021, 27 (06) : 747 - 761

← 1 2 3 4 5 →