RECENT DEVELOPMENTS ON ESPNET TOOLKIT BOOSTED BY CONFORMER

被引:137
作者
Guo, Pengcheng [1 ,2 ]
Boyer, Florian [3 ,4 ]
Chang, Xuankai [2 ]
Hayashi, Tomoki [5 ]
Higuchi, Yosuke [6 ]
Inaguma, Hirofumi [7 ]
Kamo, Naoyuki [8 ]
Li, Chenda [9 ]
Garcia-Romero, Daniel [2 ]
Shi, Jiatong [2 ]
Shi, Jing [2 ,10 ]
Watanabe, Shinji [2 ]
Wei, Kun [1 ]
Zhang, Wangyou [9 ]
Zhang, Yuekai [2 ]
机构
[1] Northwestern Polytech Univ, Xian, Shaanxi, Peoples R China
[2] Johns Hopkins Univ, Baltimore, MD 21218 USA
[3] Univ Bordeaux, LaBRI, Bordeaux, France
[4] Airudit, Pessac, France
[5] Human Dataware Lab Co Ltd, Nagoya, Aichi, Japan
[6] Waseda Univ, Tokyo, Japan
[7] Kyoto Univ, Kyoto, Japan
[8] NTT Corp, Tokyo, Japan
[9] Shanghai Jiao Tong Univ, Shanghai, Peoples R China
[10] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
来源
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年
关键词
Conformer; Transformer; End-to-End Speech Processing; SPEECH; TRANSFORMER;
D O I
10.1109/ICASSP39728.2021.9414858
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this study, we present recent developments on ESPnet: End-to-End Speech Processing toolkit, which mainly involves a recently proposed architecture called Conformer, Convolution-augmented Transformer. This paper shows the results for a wide range of end-to-end speech processing applications, such as automatic speech recognition (ASR), speech translations (ST), speech separation (SS) and text-to-speech (TTS). Our experiments reveal various training tips and significant performance benefits obtained with the Conformer on different tasks. These results are competitive or even outperform the current state-of-art Transformer models. We are preparing to release all-in-one recipes using open source and publicly available corpora for all the above tasks with pre-trained models. Our aim for this work is to contribute to our research community by reducing the burden of preparing state-of-the-art research environments usually requiring high resources.
引用
收藏
页码:5874 / 5878
页数:5
相关论文
共 31 条
[1]  
Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, DOI 10.48550/ARXIV.1409.0473]
[2]  
Dai ZH, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2978
[3]  
Dauphin YN, 2017, PR MACH LEARN RES, V70
[4]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[5]  
Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506
[6]  
Graves A., 2006, P 23 INT C MACH LEAR, P369
[7]   Conformer: Convolution-augmented Transformer for Speech Recognition [J].
Gulati, Anmol ;
Qin, James ;
Chiu, Chung-Cheng ;
Parmar, Niki ;
Zhang, Yu ;
Yu, Jiahui ;
Han, Wei ;
Wang, Shibo ;
Zhang, Zhengdong ;
Wu, Yonghui ;
Pang, Ruoming .
INTERSPEECH 2020, 2020, :5036-5040
[8]  
Hayashi T, 2020, INT CONF ACOUST SPEE, P7654, DOI [10.1109/ICASSP40776.2020.9053512, 10.1109/icassp40776.2020.9053512]
[9]  
Hori T, 2018, IEEE W SP LANG TECH, P389, DOI 10.1109/SLT.2018.8639693
[10]  
Inaguma H, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020): SYSTEM DEMONSTRATIONS, P302