ASBERT: ASR-SPECIFIC SELF-SUPERVISED LEARNING WITH SELF-TRAINING

被引：3

作者：

Kim, Hyung Yong ^{[1
]}

Kim, Byeong-Yeol ^{[1
]}

Yoo, Seung Woo ^{[1
]}

Lim, Youshin ^{[1
]}

Lim, Yunkyu ^{[1
]}

Lee, Hanbin ^{[1
]}

机构：

[1] 42Dot Inc, Seoul, South Korea

来源：

2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT | 2022年

关键词：

self-supervised learning; self-training; automatic speech recognition; masked language modeling; hidden-unit BERT;

D O I：

10.1109/SLT54892.2023.10023214

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Pre-training of self-supervised learning (SSL) generally shows a good performance on various speech processing tasks. However, this pre-training scheme may lead to a sub-optimal solution for fine-tuning a specific task, such as automatic speech recognition (ASR). In order to provide a more optimal pre-trained model for ASR, we introduce an ASR-Specific hidden-unit BERT with self-training, namely ASBERT. Motivated by self-training, we extract linguistic-related pseudo labels from the fine-tuned model, and these labels are used in the next pre-training procedure. Experimental results on LibriSpeech test-clean and test-other datasets show that ASBERT without language model (LM) outperforms the conventional SSL and self-training model, achieving a 6.3/2.0% and 15.4/13.2% relatively word error rate reduction (RERR). Moreover, without using pseudo-transcription, ASBERT yields comparable performance to the conventional self-training method.

引用

页码：9 / 14

页数：6

共 24 条

[1]

Baevski A., 2020, PROC NEURIPS

[2]

Baevski A, 2020, Arxiv, DOI arXiv:1910.05453

[3]

Chiu C. C., 2022, PROC ICML

[4] W2V-BERT: COMBINING CONTRASTIVE LEARNING AND MASKED LANGUAGE MODELING FOR SELF-SUPERVISED SPEECH PRE-TRAINING [J].

Chung, Yu-An ;

Zhang, Yu ;

Han, Wei ;

Chiu, Chung-Cheng ;

Qin, James ;

Pang, Ruoming ;

Wu, Yonghui .

2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, :244-250

[5]

Devlin J., 2018, arXiv

[6]

Graves A., 2006, P 23 INT C MACHINE L, P369, DOI [DOI 10.1145/1143844.1143891, 10.1145/1143844.1143891]

[7] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [J].

Hsu, Wei-Ning ;

Bolte, Benjamin ;

Tsai, Yao-Hung Hubert ;

Lakhotia, Kushal ;

Salakhutdinov, Ruslan ;

Mohamed, Abdelrahman .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :3451-3460

[8]

Kahn J, 2020, INT CONF ACOUST SPEE, P7669, DOI [10.1109/ICASSP40776.2020.9052942, 10.1109/icassp40776.2020.9052942]

[9]

Kingma DP, 2014, ADV NEUR IN, V27

[10] CONFIDENCE ESTIMATION FOR ATTENTION-BASED SEQUENCE-TO-SEQUENCE MODELS FOR SPEECH RECOGNITION [J].

Li, Qiujia ;

Qiu, David ;

Zhang, Yu ;

Li, Bo ;

He, Yanzhang ;

Woodland, Philip C. ;

Cao, Liangliang ;

Strohman, Trevor .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6388-6392

← 1 2 3 →