GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

被引：68

作者：

Chen, Guoguo ^{[1
,2
]}

Chai, Shuzhou ^{[1
,3
]}

Wang, Guanbo ^{[1
,3
,4
,5
]}

Du, Jiayu ^{[1
]}

Zhang, Wei-Qiang ^{[1
,3
]}

Weng, Chao ^{[6
]}

Su, Dan ^{[6
]}

Povey, Daniel ^{[7
]}

Trmal, Jan ^{[4
,5
]}

Zhang, Junbo ^{[7
]}

Jin, Mingjie ^{[6
]}

Khudanpur, Sanjeev ^{[4
,5
]}

Watanabe, Shinji ^{[4
,5
,8
]}

Zhae, Shuaijiang ^{[9
]}

Zou, Wei ^{[9
]}

Li, Xiangang ^{[9
]}

Yao, Xuchen ^{[2
]}

Wang, Yongqing ^{[7
]}

You, Zhao ^{[6
]}

Yan, Zhiyong ^{[7
]}

机构：

[1] SpeechColab, Beijing, Peoples R China

[2] Seasalt AI Inc, Bellevue, WA 98006 USA

[3] Tsinghua Univ, Dept Elect Engn, Beijing, Peoples R China

[4] Johns Hopkins Univ, CLSP, Baltimore, MD 21218 USA

[5] Johns Hopkins Univ, HLTCOE, Baltimore, MD 21218 USA

[6] Tencent AI Lab, Shenzhen, Peoples R China

[7] Xiaomi Corp, Beijing, Peoples R China

[8] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[9] KE Holdings Inc, Beijing, Peoples R China

来源：

INTERSPEECH 2021 | 2021年

关键词：

corpus; forced alignment; segmentation; speech recognition;

D O I：

10.21437/Interspeech.2021-1965

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 33,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 33,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality. Baseline systems are provided for popular speech recognition toolkits, namely Athena, ESPnet, Kaldi and Pika.

引用

页码：3670 / 3674

页数：5

共 23 条

[21]

Vaswani A, 2017, ADV NEUR IN, V30

[22]

Watanabe S., 2018, P INTERSPEECH 2018

[23] Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition [J].

Weng, Chao ;

Yu, Chengzhu ;

Cui, Jia ;

Zhang, Chunlei ;

Yu, Dong .

INTERSPEECH 2020, 2020, :966-970

← 1 2 3 →