CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning

被引:2
作者
Meng, Chutong [1 ]
Ao, Junyi [1 ,2 ]
Ko, Tom [1 ]
Wang, Mingxuan [1 ]
Li, Haizhou [2 ]
机构
[1] ByteDance, Beijing, Peoples R China
[2] Chinese Univ Hong Kong, Shenzhen Res Inst Big Data, Sch Data Sci, Shenzhen, Peoples R China
来源
INTERSPEECH 2023 | 2023年
基金
中国国家自然科学基金;
关键词
self-supervised learning; BERT; data2vec;
D O I
10.21437/Interspeech.2023-1390
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech is the surface form of a finite set of phonetic units, which can be represented by discrete codes. We propose the Code BERT (CoBERT) approach for self-supervised speech representation learning. The idea is to convert an utterance to a sequence of discrete codes, and perform code representation learning, where we predict the code representations based on a masked view of the original speech input. Unlike the prior self-distillation approaches of which the teacher and the student are of the same modality, our target model predicts representations from a different modality. CoBERT outperforms the most recent state-of-the-art performance on the ASR task and brings significant improvements on the SUPERB speech translation (ST) task. Our code and models are released at https://github.com/mct10/CoBERT.
引用
收藏
页码:2978 / 2982
页数:5
相关论文
共 26 条
[11]  
Gorman K., 2011, Canadian Acoustics, V39, P192
[12]   HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [J].
Hsu, Wei-Ning ;
Bolte, Benjamin ;
Tsai, Yao-Hung Hubert ;
Lakhotia, Kushal ;
Salakhutdinov, Ruslan ;
Mohamed, Abdelrahman .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :3451-3460
[13]  
King DB, 2015, ACS SYM SER, V1214, P1, DOI 10.1021/bk-2015-1214.ch001
[14]  
Ma Ziyang, 2023, P INT
[15]  
McAuliffe M., 2022, English (US) ARPA acoustic model (tech. rep.)
[16]   Montreal Forced Aligner: trainable text-speech alignment using Kaldi [J].
McAuliffe, Michael ;
Socolof, Michaela ;
Mihuc, Sarah ;
Wagner, Michael ;
Sonderegger, Morgan .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :498-502
[17]  
Oord A. v. d., 2018, ARXIV180703748
[18]  
Ott M, 2019, NAACL HLT 2019: THE 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES: PROCEEDINGS OF THE DEMONSTRATIONS SESSION, P48
[19]  
Panayotov V, 2015, INT CONF ACOUST SPEE, P5206, DOI 10.1109/ICASSP.2015.7178964
[20]  
Pratap V, 2019, INT CONF ACOUST SPEE, P6460, DOI 10.1109/ICASSP.2019.8683535