An improved wav2vec 2.0 pre-training approach using enhanced local dependency modeling for speech recognition

被引:2
作者
Zhu, Qiu-shi [1 ]
Zhang, Jie [1 ]
Wu, Ming-hui [2 ]
Fang, Xin [1 ,2 ]
Dai, Li-Rong [1 ]
机构
[1] Univ Sci & Technol China USTC, NEL SLIP, Hefei, Peoples R China
[2] iFlytek Co Ltd, iFlytek Res, Hefei, Peoples R China
来源
INTERSPEECH 2021 | 2021年
基金
国家重点研发计划;
关键词
Speech recognition; pre-training; wav2vec; 2.0; transformer; low-resource; local and global dependence; TRANSFORMER;
D O I
10.21437/Interspeech.2021-67
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Wav2vec 2.0 is a recently proposed self-supervised pre-training framework for learning speech representation. It utilizes a transformer to learn global contextual representation, which is effective especially in low-resource scenarios. Besides, it was shown that combining convolution neural network and transformer to model both local and global dependencies is beneficial for e.g., automatic speech recognition (ASR), natural language processing (NLP). However, how to model the local and global dependence in pre-training models is still an open question in the speech domain. In this paper, we therefore propose a new transformer encoder for enhancing the local dependency by combining convolution and self-attention modules. The transformer encoder first parallels the convolution and self-attention modules, and then serialized with another convolution module, sandwiched by a pair of feed forward modules. Experimental results show that the pre-trained model using the proposed method can reduce the word error rate (WER) compared to the reproduced wav2vec 2.0 at the cost of slightly increasing the size of training parameters.
引用
收藏
页码:4334 / 4338
页数:5
相关论文
共 30 条
[1]  
Baevski A, 2020, ADV NEUR IN, V33
[2]  
Baevski Alexei, 2019, PROC INT C LEARN REP
[3]  
Chandrahas, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, P122
[4]   Unsupervised Speech Representation Learning Using WaveNet Autoencoders [J].
Chorowski, Jan ;
Weiss, Ron J. ;
Bengio, Samy ;
van den Oord, Aaron .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (12) :2041-2053
[5]   Modeling of Motion Simulation of Welding Robot Manipulator with External Force Interaction [J].
Chung, Hyun-Joon ;
Jung, Eui-Jung ;
Chung, Goobong ;
Ryu, Jae-Kwan ;
Jeon, Do Hyung ;
Lee, Jae Chang .
PROCEEDINGS OF 2019 5TH INTERNATIONAL CONFERENCE ON MECHATRONICS AND ROBOTICS ENGINEERING (ICMRE 2019), 2019, :146-149
[6]   Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech [J].
Chung, Yu-An ;
Glass, James .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :811-815
[7]  
Conneau A., 2020, INTERSPEECH
[8]  
Devlin J., 2018, arXiv:1810.04805
[9]  
Graves Alex, 2006, P 23 INT C MACH LEAR, DOI DOI 10.1145/1143844.1143891
[10]   Conformer: Convolution-augmented Transformer for Speech Recognition [J].
Gulati, Anmol ;
Qin, James ;
Chiu, Chung-Cheng ;
Parmar, Niki ;
Zhang, Yu ;
Yu, Jiahui ;
Han, Wei ;
Wang, Shibo ;
Zhang, Zhengdong ;
Wu, Yonghui ;
Pang, Ruoming .
INTERSPEECH 2020, 2020, :5036-5040