GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

被引:48
作者
Chen, Guoguo [1 ,2 ]
Chai, Shuzhou [1 ,3 ]
Wang, Guanbo [1 ,3 ,4 ,5 ]
Du, Jiayu [1 ]
Zhang, Wei-Qiang [1 ,3 ]
Weng, Chao [6 ]
Su, Dan [6 ]
Povey, Daniel [7 ]
Trmal, Jan [4 ,5 ]
Zhang, Junbo [7 ]
Jin, Mingjie [6 ]
Khudanpur, Sanjeev [4 ,5 ]
Watanabe, Shinji [4 ,5 ,8 ]
Zhae, Shuaijiang [9 ]
Zou, Wei [9 ]
Li, Xiangang [9 ]
Yao, Xuchen [2 ]
Wang, Yongqing [7 ]
You, Zhao [6 ]
Yan, Zhiyong [7 ]
机构
[1] SpeechColab, Beijing, Peoples R China
[2] Seasalt AI Inc, Bellevue, WA 98006 USA
[3] Tsinghua Univ, Dept Elect Engn, Beijing, Peoples R China
[4] Johns Hopkins Univ, CLSP, Baltimore, MD 21218 USA
[5] Johns Hopkins Univ, HLTCOE, Baltimore, MD 21218 USA
[6] Tencent AI Lab, Shenzhen, Peoples R China
[7] Xiaomi Corp, Beijing, Peoples R China
[8] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[9] KE Holdings Inc, Beijing, Peoples R China
来源
INTERSPEECH 2021 | 2021年
关键词
corpus; forced alignment; segmentation; speech recognition;
D O I
10.21437/Interspeech.2021-1965
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 33,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 33,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality. Baseline systems are provided for popular speech recognition toolkits, namely Athena, ESPnet, Kaldi and Pika.
引用
收藏
页码:3670 / 3674
页数:5
相关论文
共 23 条
[1]  
[Anonymous], PEOPLES SPEECH
[2]   Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition [J].
Dahl, George E. ;
Yu, Dong ;
Deng, Li ;
Acero, Alex .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01) :30-42
[3]  
github, US
[4]  
Github, About us
[5]  
github, US
[6]  
github, US
[7]  
Graves A, 2014, PR MACH LEARN RES, V32, P1764
[8]   Conformer: Convolution-augmented Transformer for Speech Recognition [J].
Gulati, Anmol ;
Qin, James ;
Chiu, Chung-Cheng ;
Parmar, Niki ;
Zhang, Yu ;
Yu, Jiahui ;
Han, Wei ;
Wang, Shibo ;
Zhang, Zhengdong ;
Wu, Yonghui ;
Pang, Ruoming .
INTERSPEECH 2020, 2020, :5036-5040
[9]  
Guo P., 2020, ARXIV PREPRINT ARXIV
[10]   TED-LIUM 3: Twice as Much Data and Corpus Repartition for Experiments on Speaker Adaptation [J].
Hernandez, Francois ;
Nguyen, Vincent ;
Ghannay, Sahar ;
Tomashenko, Natalia ;
Esteve, Yannick .
SPEECH AND COMPUTER (SPECOM 2018), 2018, 11096 :198-208