AN EXPLORATION OF SELF-SUPERVISED PRETRAINED REPRESENTATIONS FOR END-TO-END SPEECH RECOGNITION

被引：34

作者：

Chang, Xuankai ^{[1
]}

Maekaku, Takashi ^{[2
]}

Guo, Pengcheng ^{[3
]}

Shi, Jing ^{[4
]}

Lu, Yen-Ju ^{[5
]}

Subramanian, Aswin Shanmugam ^{[6
]}

Wang, Tianzi ^{[6
]}

Yang, Shu-wen ^{[7
]}

Tsao, Yu ^{[5
]}

Lee, Hung-yi ^{[7
]}

Watanabe, Shinji ^{[1
]}

机构：

[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[2] Yahoo Japan Corp, Tokyo, Japan

[3] Northwestern Polytech Univ, Xian, Peoples R China

[4] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China

[5] Acad Sinica, Taipei, Taiwan

[6] Johns Hopkins Univ, Baltimore, MD USA

[7] Natl Taiwan Univ, Taipei, Taiwan

来源：

2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU) | 2021年

基金：

美国国家科学基金会;

关键词：

Representation Learning; End-to-End Speech Recognition; ESPnet;

D O I：

10.1109/ASRU51503.2021.9688137

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Self-supervised pretraining on speech data has achieved a lot of progress. High-fidelity representation of the speech signal is learned from a lot of untranscribed data and shows promising performance. Recently, there are several works focusing on evaluating the quality of self-supervised pretrained representations on various tasks without domain restriction, e.g. SUPERB. However, such evaluations do not provide a comprehensive comparison among many ASR benchmark corpora. In this paper, we focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models. We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR. Without any modification of the back-end model architectures or training strategy, some of the experiments with pretrained representations, e.g., WSJ, WSJO-2mix with HuBERT, reach or outperform current state-of-the-art (SOTA) recognition performance. Moreover, we further explore more scenarios for whether the pretraining representations are effective, such as the cross-language or overlapped speech. The scripts, configuratons and the trained models have been released in ESPnet to let the community reproduce our experiments and improve them.

引用

页码：228 / 235

页数：8

共 50 条

[1]

Abdel-Hamid O, 2012, INT CONF ACOUST SPEE, P4277, DOI 10.1109/ICASSP.2012.6288864

[2] Acoustic beamforming for speaker diarization of meetings [J].

Anguera, Xavier ;

Wooters, Chuck ;

Hernando, Javier .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (07) :2011-2022

[3]

[Anonymous], 2005, P 5 INT C METH TECHN

[4]

Baevski A., 2020, INT C LEARN REPR ICL

[5]

Baevski A, 2020, ADV NEUR IN, V33

[6]

Bu Hui, 2017 20 C ORIENTAL C

[7] Deep Clustering for Unsupervised Learning of Visual Features [J].

Caron, Mathilde ;

Bojanowski, Piotr ;

Joulin, Armand ;

Douze, Matthijs .

COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 :139-156

[8]

Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621

[9]

Chan William, 2021, P INT

[10]

Chen Guoguo, 2021, P INT

← 1 2 3 4 5 →