GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

被引:48
作者
Chen, Guoguo [1 ,2 ]
Chai, Shuzhou [1 ,3 ]
Wang, Guanbo [1 ,3 ,4 ,5 ]
Du, Jiayu [1 ]
Zhang, Wei-Qiang [1 ,3 ]
Weng, Chao [6 ]
Su, Dan [6 ]
Povey, Daniel [7 ]
Trmal, Jan [4 ,5 ]
Zhang, Junbo [7 ]
Jin, Mingjie [6 ]
Khudanpur, Sanjeev [4 ,5 ]
Watanabe, Shinji [4 ,5 ,8 ]
Zhae, Shuaijiang [9 ]
Zou, Wei [9 ]
Li, Xiangang [9 ]
Yao, Xuchen [2 ]
Wang, Yongqing [7 ]
You, Zhao [6 ]
Yan, Zhiyong [7 ]
机构
[1] SpeechColab, Beijing, Peoples R China
[2] Seasalt AI Inc, Bellevue, WA 98006 USA
[3] Tsinghua Univ, Dept Elect Engn, Beijing, Peoples R China
[4] Johns Hopkins Univ, CLSP, Baltimore, MD 21218 USA
[5] Johns Hopkins Univ, HLTCOE, Baltimore, MD 21218 USA
[6] Tencent AI Lab, Shenzhen, Peoples R China
[7] Xiaomi Corp, Beijing, Peoples R China
[8] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[9] KE Holdings Inc, Beijing, Peoples R China
来源
INTERSPEECH 2021 | 2021年
关键词
corpus; forced alignment; segmentation; speech recognition;
D O I
10.21437/Interspeech.2021-1965
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 33,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 33,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality. Baseline systems are provided for popular speech recognition toolkits, namely Athena, ESPnet, Kaldi and Pika.
引用
收藏
页码:3670 / 3674
页数:5
相关论文
共 23 条
[21]  
Vaswani A, 2017, ADV NEUR IN, V30
[22]  
Watanabe S., 2018, P INTERSPEECH 2018
[23]   Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition [J].
Weng, Chao ;
Yu, Chengzhu ;
Cui, Jia ;
Zhang, Chunlei ;
Yu, Dong .
INTERSPEECH 2020, 2020, :966-970