CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning

被引:2
作者
Meng, Chutong [1 ]
Ao, Junyi [1 ,2 ]
Ko, Tom [1 ]
Wang, Mingxuan [1 ]
Li, Haizhou [2 ]
机构
[1] ByteDance, Beijing, Peoples R China
[2] Chinese Univ Hong Kong, Shenzhen Res Inst Big Data, Sch Data Sci, Shenzhen, Peoples R China
来源
INTERSPEECH 2023 | 2023年
基金
中国国家自然科学基金;
关键词
self-supervised learning; BERT; data2vec;
D O I
10.21437/Interspeech.2023-1390
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech is the surface form of a finite set of phonetic units, which can be represented by discrete codes. We propose the Code BERT (CoBERT) approach for self-supervised speech representation learning. The idea is to convert an utterance to a sequence of discrete codes, and perform code representation learning, where we predict the code representations based on a masked view of the original speech input. Unlike the prior self-distillation approaches of which the teacher and the student are of the same modality, our target model predicts representations from a different modality. CoBERT outperforms the most recent state-of-the-art performance on the ASR task and brings significant improvements on the SUPERB speech translation (ST) task. Our code and models are released at https://github.com/mct10/CoBERT.
引用
收藏
页码:2978 / 2982
页数:5
相关论文
共 26 条
[1]   Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data [J].
Ao, Junyi ;
Zhang, Ziqiang ;
Zhou, Long ;
Liu, Shujie ;
Li, Haizhou ;
Ko, Tom ;
Dai, Lirong ;
Li, Jinyu ;
Qian, Yao ;
Wei, Furu .
INTERSPEECH 2022, 2022, :2658-2662
[2]  
Ao JY, 2022, PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), P5723
[3]  
Baevski A., 2020, ADV NEURAL INF PROCE, V33, P12449
[4]  
Baevski A., 2022, INT C MACHINE LEARNI, V162, P1298
[5]  
Baevski A, 2020, INT CONF ACOUST SPEE, P7694, DOI [10.1109/ICASSP40776.2020.9054224, 10.1109/icassp40776.2020.9054224]
[6]  
Bapna Ankur, 2021, ARXIV211010329
[7]  
Chen Sanyuan, 2022, IEEE Journal of Selected Topics in Signal Processing (JSTSP)
[8]  
Cheng X., 2022, ARXIV221203657
[9]  
Chung YA, 2020, INT CONF ACOUST SPEE, P3497, DOI [10.1109/icassp40776.2020.9054438, 10.1109/ICASSP40776.2020.9054438]
[10]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171