WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

被引：419

作者：

Chen, Sanyuan ^{[1
]}

Wang, Chengyi ^{[2
]}

Chen, Zhengyang ^{[3
]}

Wu, Yu ^{[4
]}

Liu, Shujie ^{[4
]}

Chen, Zhuo ^{[5
]}

Li, Jinyu ^{[5
]}

Kanda, Naoyuki ^{[5
]}

Yoshioka, Takuya ^{[5
]}

Xiao, Xiong ^{[5
]}

Wu, Jian ^{[5
]}

Zhou, Long ^{[4
]}

Ren, Shuo ^{[4
]}

Qian, Yanmin ^{[3
]}

Qian, Yao ^{[5
]}

Zeng, Michael ^{[5
]}

Yu, Xiangzhan ^{[1
]}

Wei, Furu ^{[4
]}

机构：

[1] Harbin Inst Technol, Dept Comp Sci & Thchnol, Harbin 150001, Peoples R China

[2] Nankai Univ, Tianjin 300071, Peoples R China

[3] Shanghai Jiao Tong Univ, Shanghai 200240, Peoples R China

[4] Microsoft Res Asia, Beijing, Peoples R China

[5] Microsoft Corp, Redmond, WA 98052 USA

来源：

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING | 2022年 / 16卷 / 06期

关键词：

Task analysis; Predictive models; Transformers; Data models; Speech recognition; Convolution; Benchmark testing; Self-supervised learning; speech pre-training; REPRESENTATION;

D O I：

10.1109/JSTSP.2022.3188113

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., learning universal representations for all speech tasks is challenging. To tackle the problem, we propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks. WavLM jointly learns masked speech prediction and denoising in pre-training. By this means, WavLM does not only keep the speech content modeling capability by the masked speech prediction, but also improves the potential to non-ASR tasks by the speech denoising. In addition, WavLM employs gated relative position bias for the Transformer structure to better capture the sequence ordering of input speech. We also scale up the training dataset from 60 k hours to 94 k hours. WavLM Large achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks.

引用

页码：1505 / 1518

页数：14

共 88 条

[1] IMAGE METHOD FOR EFFICIENTLY SIMULATING SMALL-ROOM ACOUSTICS [J].

ALLEN, JB ;

BERKLEY, DA .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1979, 65 (04) :943-950

[2]

Ao JY, 2022, Arxiv, DOI arXiv:2110.07205

[3]

Baevski A, 2020, INT CONF ACOUST SPEE, P7694, DOI [10.1109/icassp40776.2020.9054224, 10.1109/ICASSP40776.2020.9054224]

[4]

Baevski Alexei, 2020, Advances in neural information processing systems

[5] JOINT UNSUPERVISED AND SUPERVISED TRAINING FOR MULTILINGUAL ASR [J].

Bai, Junwen ;

Li, Bo ;

Zhang, Yu ;

Bapna, Ankur ;

Siddhartha, Nikhil ;

Sim, Khe Chai ;

Sainath, Tara N. .

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :6402-6406

[6]

Chan W., 2021, arXiv, DOI DOI 10.48550/ARXIV.2104.02133

[7] GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio [J].

Chen, Guoguo ;

Chai, Shuzhou ;

Wang, Guanbo ;

Du, Jiayu ;

Zhang, Wei-Qiang ;

Weng, Chao ;

Su, Dan ;

Povey, Daniel ;

Trmal, Jan ;

Zhang, Junbo ;

Jin, Mingjie ;

Khudanpur, Sanjeev ;

Watanabe, Shinji ;

Zhae, Shuaijiang ;

Zou, Wei ;

Li, Xiangang ;

Yao, Xuchen ;

Wang, Yongqing ;

You, Zhao ;

Yan, Zhiyong .

INTERSPEECH 2021, 2021, :3670-3674

[8] Ultra Fast Speech Separation Model with Teacher Student Learning [J].

Chen, Sanyuan ;

Wu, Yu ;

Chen, Zhuo ;

Wu, Jian ;

Yoshioka, Takuya ;

Liu, Shujie ;

Li, Jinyu ;

Yu, Xiangzhan .

INTERSPEECH 2021, 2021, :3026-3030

[9] UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING [J].

Chen, Sanyuan ;

Wu, Yu ;

Wang, Chengyi ;

Chen, Zhengyang ;

Chen, Zhuo ;

Liu, Shujie ;

Wu, Jian ;

Qian, Yao ;

Wei, Furu ;

Li, Jinyu ;

Yu, Xiangzhan .

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :6152-6156

[10] DON'T SHOOT BUTTERFLY WITH RIFLES: MULTI-CHANNEL CONTINUOUS SPEECH SEPARATION WITH EARLY EXIT TRANSFORMER [J].

Chen, Sanyuan ;

Wu, Yu ;

Chen, Zhuo ;

Yoshioka, Takuya ;

Liu, Shujie ;

Li, Jinyu ;

Yu, Xiangzhan .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6139-6143

← 1 2 3 4 5 6 7 8 9 →