Improving Transformer-based Speech Recognition with Unsupervised Pre-training and Multi-task Semantic Knowledge Learning

被引:7
作者
Li, Song [1 ]
Li, Lin [1 ]
Hong, Qingyang [2 ]
Liu, Lingling [1 ]
机构
[1] Xiamen Univ, Sch Elect Sci & Engn, Xiamen, Fujian, Peoples R China
[2] Xiamen Univ, Sch Informat, Xiamen, Fujian, Peoples R China
来源
INTERSPEECH 2020 | 2020年
基金
中国国家自然科学基金;
关键词
unsupervised pre-training; speech recognition; Transformer; multi-task learning; semi-supervised learning;
D O I
10.21437/Interspeech.2020-2007
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Recently, the Transformer-based end-to-end speech recognition system has become a state-of-the-art technology. However, one prominent problem with current end-to-end speech recognition systems is that an extensive amount of paired data are required to achieve better recognition performance. In order to grapple with such an issue, we propose two unsupervised pre-training strategies for the encoder and the decoder of Transformer respectively, which make full use of unpaired data for training. In addition, we propose a new semi-supervised fine-tuning method named multi-task semantic knowledge learning to strengthen the Transformer's ability to learn about semantic knowledge, thereby improving the system performance. We achieve the best CER with our proposed methods on AISHELL-1 test set: 5.9%, which exceeds the best end-to-end model by 10.6% relative CER. Moreover, relative CER reduction of 20.3% and 17.8% are obtained for low-resource Mandarin and English data sets, respectively.
引用
收藏
页码:5006 / 5010
页数:5
相关论文
共 35 条
[1]  
[Anonymous], 2015, PROC NEURIPS, P577, DOI DOI 10.1016/0167-739X(94)90007-8
[2]  
[Anonymous], 2017, arXiv preprint arxiv 1712.07628
[3]  
Bai X, 2018, PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE (ICPRAI 2018), P210
[4]  
Bandanau D, 2016, INT CONF ACOUST SPEE, P4945, DOI 10.1109/ICASSP.2016.7472618
[5]  
Bu H, 2017, 2017 20TH CONFERENCE OF THE ORIENTAL CHAPTER OF THE INTERNATIONAL COORDINATING COMMITTEE ON SPEECH DATABASES AND SPEECH I/O SYSTEMS AND ASSESSMENT (O-COCOSDA), P58, DOI 10.1109/ICSDA.2017.8384449
[6]  
Chan William, 2015, CoRR, V2, P5
[7]  
Chiu CC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4774, DOI 10.1109/ICASSP.2018.8462105
[8]  
Cho J, 2018, IEEE W SP LANG TECH, P521, DOI 10.1109/SLT.2018.8639655
[9]   An Unsupervised Autoregressive Model for Speech Representation Learning [J].
Chung, Yu-An ;
Hsu, Wei-Ning ;
Tang, Hao ;
Glass, James .
INTERSPEECH 2019, 2019, :146-150
[10]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171