CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning

被引：2

作者：

Meng, Chutong ^{[1
]}

Ao, Junyi ^{[1
,2
]}

Ko, Tom ^{[1
]}

Wang, Mingxuan ^{[1
]}

Li, Haizhou ^{[2
]}

机构：

[1] ByteDance, Beijing, Peoples R China

[2] Chinese Univ Hong Kong, Shenzhen Res Inst Big Data, Sch Data Sci, Shenzhen, Peoples R China

来源：

INTERSPEECH 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

self-supervised learning; BERT; data2vec;

D O I：

10.21437/Interspeech.2023-1390

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speech is the surface form of a finite set of phonetic units, which can be represented by discrete codes. We propose the Code BERT (CoBERT) approach for self-supervised speech representation learning. The idea is to convert an utterance to a sequence of discrete codes, and perform code representation learning, where we predict the code representations based on a masked view of the original speech input. Unlike the prior self-distillation approaches of which the teacher and the student are of the same modality, our target model predicts representations from a different modality. CoBERT outperforms the most recent state-of-the-art performance on the ASR task and brings significant improvements on the SUPERB speech translation (ST) task. Our code and models are released at https://github.com/mct10/CoBERT.

引用

页码：2978 / 2982

页数：5

共 26 条

[1] Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data [J].

Ao, Junyi ;

Zhang, Ziqiang ;

Zhou, Long ;

Liu, Shujie ;

Li, Haizhou ;

Ko, Tom ;

Dai, Lirong ;

Li, Jinyu ;

Qian, Yao ;

Wei, Furu .

INTERSPEECH 2022, 2022, :2658-2662

[2]

Ao JY, 2022, PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), P5723

[3]

Baevski A., 2020, ADV NEURAL INF PROCE, V33, P12449

[4]

Baevski A., 2022, INT C MACHINE LEARNI, V162, P1298

[5]

Baevski A, 2020, INT CONF ACOUST SPEE, P7694, DOI [10.1109/ICASSP40776.2020.9054224, 10.1109/icassp40776.2020.9054224]

[6]

Bapna Ankur, 2021, ARXIV211010329

[7]

Chen Sanyuan, 2022, IEEE Journal of Selected Topics in Signal Processing (JSTSP)

[8]

Cheng X., 2022, ARXIV221203657

[9]

Chung YA, 2020, INT CONF ACOUST SPEE, P3497, DOI [10.1109/icassp40776.2020.9054438, 10.1109/ICASSP40776.2020.9054438]

[10]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

← 1 2 3 →