RECENT DEVELOPMENTS ON ESPNET TOOLKIT BOOSTED BY CONFORMER

被引：137

作者：

Guo, Pengcheng ^{[1
,2
]}

Boyer, Florian ^{[3
,4
]}

Chang, Xuankai ^{[2
]}

Hayashi, Tomoki ^{[5
]}

Higuchi, Yosuke ^{[6
]}

Inaguma, Hirofumi ^{[7
]}

Kamo, Naoyuki ^{[8
]}

Li, Chenda ^{[9
]}

Garcia-Romero, Daniel ^{[2
]}

Shi, Jiatong ^{[2
]}

Shi, Jing ^{[2
,10
]}

Watanabe, Shinji ^{[2
]}

Wei, Kun ^{[1
]}

Zhang, Wangyou ^{[9
]}

Zhang, Yuekai ^{[2
]}

机构：

[1] Northwestern Polytech Univ, Xian, Shaanxi, Peoples R China

[2] Johns Hopkins Univ, Baltimore, MD 21218 USA

[3] Univ Bordeaux, LaBRI, Bordeaux, France

[4] Airudit, Pessac, France

[5] Human Dataware Lab Co Ltd, Nagoya, Aichi, Japan

[6] Waseda Univ, Tokyo, Japan

[7] Kyoto Univ, Kyoto, Japan

[8] NTT Corp, Tokyo, Japan

[9] Shanghai Jiao Tong Univ, Shanghai, Peoples R China

[10] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年

关键词：

Conformer; Transformer; End-to-End Speech Processing; SPEECH; TRANSFORMER;

D O I：

10.1109/ICASSP39728.2021.9414858

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In this study, we present recent developments on ESPnet: End-to-End Speech Processing toolkit, which mainly involves a recently proposed architecture called Conformer, Convolution-augmented Transformer. This paper shows the results for a wide range of end-to-end speech processing applications, such as automatic speech recognition (ASR), speech translations (ST), speech separation (SS) and text-to-speech (TTS). Our experiments reveal various training tips and significant performance benefits obtained with the Conformer on different tasks. These results are competitive or even outperform the current state-of-art Transformer models. We are preparing to release all-in-one recipes using open source and publicly available corpora for all the above tasks with pre-trained models. Our aim for this work is to contribute to our research community by reducing the burden of preparing state-of-the-art research environments usually requiring high resources.

引用

页码：5874 / 5878

页数：5

共 31 条

[1]

Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, DOI 10.48550/ARXIV.1409.0473]

[2]

Dai ZH, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2978

[3]

Dauphin YN, 2017, PR MACH LEARN RES, V70

[4]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[5]

Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506

[6]

Graves A., 2006, P 23 INT C MACH LEAR, P369

[7] Conformer: Convolution-augmented Transformer for Speech Recognition [J].

Gulati, Anmol ;

Qin, James ;

Chiu, Chung-Cheng ;

Parmar, Niki ;

Zhang, Yu ;

Yu, Jiahui ;

Han, Wei ;

Wang, Shibo ;

Zhang, Zhengdong ;

Wu, Yonghui ;

Pang, Ruoming .

INTERSPEECH 2020, 2020, :5036-5040

[8]

Hayashi T, 2020, INT CONF ACOUST SPEE, P7654, DOI [10.1109/ICASSP40776.2020.9053512, 10.1109/icassp40776.2020.9053512]

[9]

Hori T, 2018, IEEE W SP LANG TECH, P389, DOI 10.1109/SLT.2018.8639693

[10]

Inaguma H, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020): SYSTEM DEMONSTRATIONS, P302

← 1 2 3 4 →