CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning

被引：2

作者：

Meng, Chutong ^{[1
]}

Ao, Junyi ^{[1
,2
]}

Ko, Tom ^{[1
]}

Wang, Mingxuan ^{[1
]}

Li, Haizhou ^{[2
]}

机构：

[1] ByteDance, Beijing, Peoples R China

[2] Chinese Univ Hong Kong, Shenzhen Res Inst Big Data, Sch Data Sci, Shenzhen, Peoples R China

来源：

INTERSPEECH 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

self-supervised learning; BERT; data2vec;

D O I：

10.21437/Interspeech.2023-1390

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speech is the surface form of a finite set of phonetic units, which can be represented by discrete codes. We propose the Code BERT (CoBERT) approach for self-supervised speech representation learning. The idea is to convert an utterance to a sequence of discrete codes, and perform code representation learning, where we predict the code representations based on a masked view of the original speech input. Unlike the prior self-distillation approaches of which the teacher and the student are of the same modality, our target model predicts representations from a different modality. CoBERT outperforms the most recent state-of-the-art performance on the ASR task and brings significant improvements on the SUPERB speech translation (ST) task. Our code and models are released at https://github.com/mct10/CoBERT.

引用

页码：2978 / 2982

页数：5

共 26 条

[11]

Gorman K., 2011, Canadian Acoustics, V39, P192

[12] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [J].

Hsu, Wei-Ning ;

Bolte, Benjamin ;

Tsai, Yao-Hung Hubert ;

Lakhotia, Kushal ;

Salakhutdinov, Ruslan ;

Mohamed, Abdelrahman .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :3451-3460

[13]

King DB, 2015, ACS SYM SER, V1214, P1, DOI 10.1021/bk-2015-1214.ch001

[14]

Ma Ziyang, 2023, P INT

[15]

McAuliffe M., 2022, English (US) ARPA acoustic model (tech. rep.)

[16] Montreal Forced Aligner: trainable text-speech alignment using Kaldi [J].

McAuliffe, Michael ;

Socolof, Michaela ;

Mihuc, Sarah ;

Wagner, Michael ;

Sonderegger, Morgan .

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :498-502

[17]

Oord A. v. d., 2018, ARXIV180703748

[18]

Ott M, 2019, NAACL HLT 2019: THE 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES: PROCEEDINGS OF THE DEMONSTRATIONS SESSION, P48

[19]

Panayotov V, 2015, INT CONF ACOUST SPEE, P5206, DOI 10.1109/ICASSP.2015.7178964

[20]

Pratap V, 2019, INT CONF ACOUST SPEE, P6460, DOI 10.1109/ICASSP.2019.8683535

← 1 2 3 →