AN EXPLORATION OF SELF-SUPERVISED PRETRAINED REPRESENTATIONS FOR END-TO-END SPEECH RECOGNITION

被引:34
作者
Chang, Xuankai [1 ]
Maekaku, Takashi [2 ]
Guo, Pengcheng [3 ]
Shi, Jing [4 ]
Lu, Yen-Ju [5 ]
Subramanian, Aswin Shanmugam [6 ]
Wang, Tianzi [6 ]
Yang, Shu-wen [7 ]
Tsao, Yu [5 ]
Lee, Hung-yi [7 ]
Watanabe, Shinji [1 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] Yahoo Japan Corp, Tokyo, Japan
[3] Northwestern Polytech Univ, Xian, Peoples R China
[4] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
[5] Acad Sinica, Taipei, Taiwan
[6] Johns Hopkins Univ, Baltimore, MD USA
[7] Natl Taiwan Univ, Taipei, Taiwan
来源
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU) | 2021年
基金
美国国家科学基金会;
关键词
Representation Learning; End-to-End Speech Recognition; ESPnet;
D O I
10.1109/ASRU51503.2021.9688137
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Self-supervised pretraining on speech data has achieved a lot of progress. High-fidelity representation of the speech signal is learned from a lot of untranscribed data and shows promising performance. Recently, there are several works focusing on evaluating the quality of self-supervised pretrained representations on various tasks without domain restriction, e.g. SUPERB. However, such evaluations do not provide a comprehensive comparison among many ASR benchmark corpora. In this paper, we focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models. We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR. Without any modification of the back-end model architectures or training strategy, some of the experiments with pretrained representations, e.g., WSJ, WSJO-2mix with HuBERT, reach or outperform current state-of-the-art (SOTA) recognition performance. Moreover, we further explore more scenarios for whether the pretraining representations are effective, such as the cross-language or overlapped speech. The scripts, configuratons and the trained models have been released in ESPnet to let the community reproduce our experiments and improve them.
引用
收藏
页码:228 / 235
页数:8
相关论文
共 50 条
[1]  
Abdel-Hamid O, 2012, INT CONF ACOUST SPEE, P4277, DOI 10.1109/ICASSP.2012.6288864
[2]   Acoustic beamforming for speaker diarization of meetings [J].
Anguera, Xavier ;
Wooters, Chuck ;
Hernando, Javier .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (07) :2011-2022
[3]  
[Anonymous], 2005, P 5 INT C METH TECHN
[4]  
Baevski A., 2020, INT C LEARN REPR ICL
[5]  
Baevski A, 2020, ADV NEUR IN, V33
[6]  
Bu Hui, 2017 20 C ORIENTAL C
[7]   Deep Clustering for Unsupervised Learning of Visual Features [J].
Caron, Mathilde ;
Bojanowski, Piotr ;
Joulin, Armand ;
Douze, Matthijs .
COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 :139-156
[8]  
Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[9]  
Chan William, 2021, P INT
[10]  
Chen Guoguo, 2021, P INT