ASBERT: ASR-SPECIFIC SELF-SUPERVISED LEARNING WITH SELF-TRAINING

被引:3
作者
Kim, Hyung Yong [1 ]
Kim, Byeong-Yeol [1 ]
Yoo, Seung Woo [1 ]
Lim, Youshin [1 ]
Lim, Yunkyu [1 ]
Lee, Hanbin [1 ]
机构
[1] 42Dot Inc, Seoul, South Korea
来源
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT | 2022年
关键词
self-supervised learning; self-training; automatic speech recognition; masked language modeling; hidden-unit BERT;
D O I
10.1109/SLT54892.2023.10023214
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Pre-training of self-supervised learning (SSL) generally shows a good performance on various speech processing tasks. However, this pre-training scheme may lead to a sub-optimal solution for fine-tuning a specific task, such as automatic speech recognition (ASR). In order to provide a more optimal pre-trained model for ASR, we introduce an ASR-Specific hidden-unit BERT with self-training, namely ASBERT. Motivated by self-training, we extract linguistic-related pseudo labels from the fine-tuned model, and these labels are used in the next pre-training procedure. Experimental results on LibriSpeech test-clean and test-other datasets show that ASBERT without language model (LM) outperforms the conventional SSL and self-training model, achieving a 6.3/2.0% and 15.4/13.2% relatively word error rate reduction (RERR). Moreover, without using pseudo-transcription, ASBERT yields comparable performance to the conventional self-training method.
引用
收藏
页码:9 / 14
页数:6
相关论文
共 24 条
[1]  
Baevski A., 2020, PROC NEURIPS
[2]  
Baevski A, 2020, Arxiv, DOI arXiv:1910.05453
[3]  
Chiu C. C., 2022, PROC ICML
[4]   W2V-BERT: COMBINING CONTRASTIVE LEARNING AND MASKED LANGUAGE MODELING FOR SELF-SUPERVISED SPEECH PRE-TRAINING [J].
Chung, Yu-An ;
Zhang, Yu ;
Han, Wei ;
Chiu, Chung-Cheng ;
Qin, James ;
Pang, Ruoming ;
Wu, Yonghui .
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, :244-250
[5]  
Devlin J., 2018, arXiv
[6]  
Graves A., 2006, P 23 INT C MACHINE L, P369, DOI [DOI 10.1145/1143844.1143891, 10.1145/1143844.1143891]
[7]   HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [J].
Hsu, Wei-Ning ;
Bolte, Benjamin ;
Tsai, Yao-Hung Hubert ;
Lakhotia, Kushal ;
Salakhutdinov, Ruslan ;
Mohamed, Abdelrahman .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :3451-3460
[8]  
Kahn J, 2020, INT CONF ACOUST SPEE, P7669, DOI [10.1109/ICASSP40776.2020.9052942, 10.1109/icassp40776.2020.9052942]
[9]  
Kingma DP, 2014, ADV NEUR IN, V27
[10]   CONFIDENCE ESTIMATION FOR ATTENTION-BASED SEQUENCE-TO-SEQUENCE MODELS FOR SPEECH RECOGNITION [J].
Li, Qiujia ;
Qiu, David ;
Zhang, Yu ;
Li, Bo ;
He, Yanzhang ;
Woodland, Philip C. ;
Cao, Liangliang ;
Strohman, Trevor .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6388-6392