COMBINING UNSUPERVISED AND TEXT AUGMENTED SEMI-SUPERVISED LEARNING FOR LOW RESOURCED AUTOREGRESSIVE SPEECH RECOGNITION

被引：2

作者：

Li, Chak-Fai ^{[1
]}

Keith, Francis ^{[1
]}

Hartmann, William ^{[1
]}

Snover, Matthew ^{[1
]}

机构：

[1] Raytheon BBN Technol, Cambridge, MA 02138 USA

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

seq2seq; unsupervised learning; semi-supervised training; domain adaptation; REPRESENTATION;

D O I：

10.1109/ICASSP43922.2022.9747005

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Recent advances in unsupervised representation learning have demonstrated the impact of pretraining on large amounts of read speech. We adapt these techniques for domain adaptation in low-resource-both in terms of data and compute-conversational and broadcast domains. Moving beyond CTC, we pretrain state-of-the-art Conformer models in an unsupervised manner. While the unsupervised approach outperforms traditional semi-supervised training, the techniques are complementary. Combining the techniques is a 5% absolute improvement in WER, averaged over all conditions, compared to semi-supervised training alone. Additional text data is incorporated through external language models. By using CTC-based decoding, we are better able to take advantage of the additional text data. When used as a transcription model, it allows the Conformer model to better incorporate the knowledge from the language model through semi-supervised training than shallow fusion. Final performance is an additional 2% better absolute when using CTC-based decoding for semi-supervised training compared to shallow fusion.

引用

页码：6892 / 6896

页数：5

共 31 条

[1]

[Anonymous], 2015, INTERSPEECH

[2]

[Anonymous], 2017, ARXIV170806426

[3]

Baevski A., 2020, Advances in neural information processing systems

[4] An Unsupervised Autoregressive Model for Speech Representation Learning [J].

Chung, Yu-An ;

Hsu, Wei-Ning ;

Tang, Hao ;

Glass, James .

INTERSPEECH 2019, 2019, :146-150

[5]

Devlin Jacob, 2018, CoRR

[6]

Forcada Mikel L, 2019, P MACH TRANSL SUMM 1, P118

[7] Conformer: Convolution-augmented Transformer for Speech Recognition [J].

Gulati, Anmol ;

Qin, James ;

Chiu, Chung-Cheng ;

Parmar, Niki ;

Zhang, Yu ;

Yu, Jiahui ;

Han, Wei ;

Wang, Shibo ;

Zhang, Zhengdong ;

Wu, Yonghui ;

Pang, Ruoming .

INTERSPEECH 2020, 2020, :5036-5040

[8]

Gulcehre Caglar, 2015, arXiv

[9] RECENT DEVELOPMENTS ON ESPNET TOOLKIT BOOSTED BY CONFORMER [J].

Guo, Pengcheng ;

Boyer, Florian ;

Chang, Xuankai ;

Hayashi, Tomoki ;

Higuchi, Yosuke ;

Inaguma, Hirofumi ;

Kamo, Naoyuki ;

Li, Chenda ;

Garcia-Romero, Daniel ;

Shi, Jiatong ;

Shi, Jing ;

Watanabe, Shinji ;

Wei, Kun ;

Zhang, Wangyou ;

Zhang, Yuekai .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :5874-5878

[10]

Hsu W.-N., 2021, ARXIV210607447

← 1 2 3 4 →