COMBINING UNSUPERVISED AND TEXT AUGMENTED SEMI-SUPERVISED LEARNING FOR LOW RESOURCED AUTOREGRESSIVE SPEECH RECOGNITION

被引:2
作者
Li, Chak-Fai [1 ]
Keith, Francis [1 ]
Hartmann, William [1 ]
Snover, Matthew [1 ]
机构
[1] Raytheon BBN Technol, Cambridge, MA 02138 USA
来源
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年
关键词
seq2seq; unsupervised learning; semi-supervised training; domain adaptation; REPRESENTATION;
D O I
10.1109/ICASSP43922.2022.9747005
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recent advances in unsupervised representation learning have demonstrated the impact of pretraining on large amounts of read speech. We adapt these techniques for domain adaptation in low-resource-both in terms of data and compute-conversational and broadcast domains. Moving beyond CTC, we pretrain state-of-the-art Conformer models in an unsupervised manner. While the unsupervised approach outperforms traditional semi-supervised training, the techniques are complementary. Combining the techniques is a 5% absolute improvement in WER, averaged over all conditions, compared to semi-supervised training alone. Additional text data is incorporated through external language models. By using CTC-based decoding, we are better able to take advantage of the additional text data. When used as a transcription model, it allows the Conformer model to better incorporate the knowledge from the language model through semi-supervised training than shallow fusion. Final performance is an additional 2% better absolute when using CTC-based decoding for semi-supervised training compared to shallow fusion.
引用
收藏
页码:6892 / 6896
页数:5
相关论文
共 31 条
[1]  
[Anonymous], 2015, INTERSPEECH
[2]  
[Anonymous], 2017, ARXIV170806426
[3]  
Baevski A., 2020, Advances in neural information processing systems
[4]   An Unsupervised Autoregressive Model for Speech Representation Learning [J].
Chung, Yu-An ;
Hsu, Wei-Ning ;
Tang, Hao ;
Glass, James .
INTERSPEECH 2019, 2019, :146-150
[5]  
Devlin Jacob, 2018, CoRR
[6]  
Forcada Mikel L, 2019, P MACH TRANSL SUMM 1, P118
[7]   Conformer: Convolution-augmented Transformer for Speech Recognition [J].
Gulati, Anmol ;
Qin, James ;
Chiu, Chung-Cheng ;
Parmar, Niki ;
Zhang, Yu ;
Yu, Jiahui ;
Han, Wei ;
Wang, Shibo ;
Zhang, Zhengdong ;
Wu, Yonghui ;
Pang, Ruoming .
INTERSPEECH 2020, 2020, :5036-5040
[8]  
Gulcehre Caglar, 2015, arXiv
[9]   RECENT DEVELOPMENTS ON ESPNET TOOLKIT BOOSTED BY CONFORMER [J].
Guo, Pengcheng ;
Boyer, Florian ;
Chang, Xuankai ;
Hayashi, Tomoki ;
Higuchi, Yosuke ;
Inaguma, Hirofumi ;
Kamo, Naoyuki ;
Li, Chenda ;
Garcia-Romero, Daniel ;
Shi, Jiatong ;
Shi, Jing ;
Watanabe, Shinji ;
Wei, Kun ;
Zhang, Wangyou ;
Zhang, Yuekai .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :5874-5878
[10]  
Hsu W.-N., 2021, ARXIV210607447