Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition

被引:23
作者
Chen, Weidong [1 ]
Xing, Xiaofen [1 ]
Chen, Peihao [2 ]
Xu, Xiangmin [3 ,4 ]
机构
[1] South China Univ Technol, Sch Elect & Informat Engn, Guangzhou 510640, Peoples R China
[2] South China Univ Technol, Sch Software Engn, Guangzhou 510640, Peoples R China
[3] South China Univ Technol, Sch Future Technol, Guangzhou 511442, Peoples R China
[4] Pazhou Lab, Guangzhou 510330, Peoples R China
基金
国家重点研发计划;
关键词
Training; Emotion recognition; Adaptation models; Cross layer design; Computational modeling; Semantics; Speech recognition; Pretrained model; speech emotion recognition; self-supervised learning; representation learning; FRAMEWORK; NETWORK; ENHANCEMENT;
D O I
10.1109/TAFFC.2024.3369726
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This article presents a paradigm that adapts general large-scale pretrained models (PTMs) to speech emotion recognition task. Although PTMs shed new light on artificial general intelligence, they are constructed with general tasks in mind, and thus, their efficacy for specific tasks can be further improved. Additionally, employing PTMs in practical applications can be challenging due to their considerable size. Above limitations spawn another research direction, namely, optimizing large-scale PTMs for specific tasks to generate task-specific PTMs that are both compact and effective. In this paper, we focus on the speech emotion recognition task and propose an improVed emotion-specific pretrained encoder called Vesper. Vesper is pretrained on a speech dataset based on WavLM and takes into account emotional characteristics. To enhance sensitivity to emotional information, Vesper employs an emotion-guided masking strategy to identify the regions that need masking. Subsequently, Vesper employs hierarchical and cross-layer self-supervision to improve its ability to capture acoustic and semantic representations, both of which are crucial for emotion recognition. Experimental results on the IEMOCAP, MELD, and CREMA-D datasets demonstrate that Vesper with 4 layers outperforms WavLM Base with 12 layers, and the performance of Vesper with 12 layers surpasses that of WavLM Large with 24 layers.
引用
收藏
页码:1711 / 1724
页数:14
相关论文
共 74 条
[11]   VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning [J].
Chen, Jun ;
Guo, Han ;
Yi, Kai ;
Li, Boyang ;
Elhoseiny, Mohamed .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :18009-18019
[12]  
Chen L.-W., 2023, IEEE INT C AC SPEECH, P1
[13]  
Chen PH, 2021, AAAI CONF ARTIF INTE, V35, P1045
[14]   WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing [J].
Chen, Sanyuan ;
Wang, Chengyi ;
Chen, Zhengyang ;
Wu, Yu ;
Liu, Shujie ;
Chen, Zhuo ;
Li, Jinyu ;
Kanda, Naoyuki ;
Yoshioka, Takuya ;
Xiao, Xiong ;
Wu, Jian ;
Zhou, Long ;
Ren, Shuo ;
Qian, Yanmin ;
Qian, Yao ;
Zeng, Michael ;
Yu, Xiangzhan ;
Wei, Furu .
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) :1505-1518
[15]  
Chen T, 2020, PR MACH LEARN RES, V119
[16]  
Chen W., 2023, IEEE INT C AC SPEECH, P1
[17]   SpeechFormer: A Hierarchical Efficient Framework Incorporating the Characteristics of Speech [J].
Chen, Weidong ;
Xing, Xiaofen ;
Xu, Xiangmin ;
Pang, Jianxin ;
Du, Lan .
INTERSPEECH 2022, 2022, :346-350
[18]   SpeechFormer plus plus : A Hierarchical Efficient Framework for Paralinguistic Speech Processing [J].
Chen, Weidong ;
Xing, Xiaofen ;
Xu, Xiangmin ;
Pang, Jianxin ;
Du, Lan .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 :775-788
[19]   KEY-SPARSE TRANSFORMER FOR MULTIMODAL SPEECH EMOTION RECOGNITION [J].
Chen, Weidong ;
Xing, Xiaofeng ;
Xu, Xiangmin ;
Yang, Jichen ;
Pang, Jianxin .
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :6897-6901
[20]  
Chetia Phukan O., 2023, INTERSPEECH, P1903