AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario

被引:26
作者
Fu, Yihui [1 ]
Cheng, Luyao [1 ]
Lv, Shubo [1 ]
Jv, Yukai [1 ]
Kong, Yuxiang [1 ]
Chen, Zhuo [2 ]
Hu, Yanxin [1 ]
Xie, Lei [1 ]
Wu, Jian [3 ]
Bu, Hui [4 ]
Xu, Xin [4 ]
Du, Jun [5 ]
Chen, Jingdong [1 ]
机构
[1] Northwestern Polytech Univ, Xian, Peoples R China
[2] Microsoft Corp, Redmond, WA 98052 USA
[3] Microsoft Corp, Beijing, Peoples R China
[4] Beijing Shell Shell Technol Co Ltd, Beijing, Peoples R China
[5] Univ Sci & Technol China, Hefei, Peoples R China
来源
INTERSPEECH 2021 | 2021年
关键词
AISHELL-4; speech front-end processing; speech recognition; speaker diarization; conference scenario; Mandarin; CORPUS; NOISY;
D O I
10.21437/Interspeech.2021-1397
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
In this paper, we present AISHELL-4, a sizable real-recorded Mandarin speech dataset collected by 8-channel circular microphone array for speech processing in conference scenario. The dataset consists of 211 recorded meeting sessions, each containing 4 to 8 speakers, with a total length of 120 hours. This dataset aims to bridge the advanced research on multi-speaker processing and the practical application scenario in three aspects. With real recorded meetings, AISHELL-4 provides realistic acoustics and rich natural speech characteristics in conversation such as short pause, speech overlap, quick speaker turn, noise, etc. Meanwhile, accurate transcription and speaker voice activity are provided for each meeting in AISHELL-4. This allows the researchers to explore different aspects in meeting processing, ranging from individual tasks such as speech front-end processing, speech recognition and speaker diarization, to multi-modality modeling and joint optimization of relevant tasks. Given most open source dataset for multi-speaker tasks are in English, AISHELL-4 is the only Mandarin dataset for conversation speech, providing additional value for data diversity in speech community. We also release a PyTorch-based training and evaluation framework as baseline system to promote reproducible research in this field.
引用
收藏
页码:3665 / 3669
页数:5
相关论文
共 34 条
[11]   Acoustic modelling from the signal domain using CNNs [J].
Ghahremanil, Pegah ;
Manoharl, Vimal ;
Povey, Daniel ;
Khudanpur, Sanjeev .
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :3434-3438
[12]  
Godfrey J. J., 1992, ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech and Signal Processing (Cat. No.92CH3103-9), P517, DOI 10.1109/ICASSP.1992.225858
[13]   Strategies to Improve the Robustness of Agglomerative Hierarchical Clustering Under Data Source Variation for Speaker Diarization [J].
Han, Kyu J. ;
Kim, Samuel ;
Narayanan, Shrikanth S. .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2008, 16 (08) :1590-1601
[14]  
Hershey JR, 2016, INT CONF ACOUST SPEE, P31, DOI 10.1109/ICASSP.2016.7471631
[15]  
Ioffe S, 2006, LECT NOTES COMPUT SC, V3954, P531
[16]  
Janin A, 2003, 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS, P364
[17]   INVESTIGATION OF END-TO-END SPEAKER-ATTRIBUTED ASR FOR CONTINUOUS MULTI-TALKER RECORDINGS [J].
Kanda, Naoyuki ;
Chang, Xuankai ;
Gaur, Yashesh ;
Wang, Xiaofei ;
Meng, Zhong ;
Chen, Zhuo ;
Yoshioka, Takuya .
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, :809-816
[18]  
Landini F., 2020, ARXIV201214952
[19]   DUAL-PATH RNN FOR LONG RECORDING SPEECH SEPARATION [J].
Li, Chenda ;
Luo, Yi ;
Han, Cong ;
Li, Jinyu ;
Yoshioka, Takuya ;
Zhou, Tianyan ;
Delcroix, Marc ;
Kinoshita, Keisuke ;
Boeddeker, Christoph ;
Qian, Yanmin ;
Watanabe, Shinji ;
Chen, Zhuo .
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, :865-872
[20]  
Maciejewski M, 2020, INT CONF ACOUST SPEE, P696, DOI [10.1109/ICASSP40776.2020.9053327, 10.1109/icassp40776.2020.9053327]