ASR Bundestag: A Large-Scale Political Debate Dataset in German

被引:0
作者
Wirth, Johannes [1 ]
Peinl, Rene [1 ]
机构
[1] Univ Appl Sci Hof, Alfons Goppel Pl 1, D-95028 Hof, Germany
来源
INTELLIGENT SYSTEMS AND APPLICATIONS, VOL 2, INTELLISYS 2023 | 2024年 / 823卷
关键词
Automatic Speech Recognition; Dataset; German;
D O I
10.1007/978-3-031-47724-9_13
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We presentASR Bundestag, a dataset for automatic speech recognition in German, consisting of 610 h of aligned audio-transcript pairs for supervised training as well as 1,038 h of unlabeled audio snippets for self-supervised learning, based on raw audio data and transcriptions from plenary sessions and committee meetings of the German parliament. In addition, we discuss utilized approaches for the automated creation of speech datasets and assess the quality of the resulting dataset based on evaluations and finetuning of a pre-trained state of the art model. We make the dataset publicly available, including all subsets.
引用
收藏
页码:190 / 202
页数:13
相关论文
共 39 条
[1]  
[Anonymous], 2019, The M-AILABS speech dataset
[2]  
Ardila R, 2020, PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), P4218
[3]  
Baevski A, 2020, Arxiv, DOI arXiv:2006.11477
[4]  
Bakhturina E, 2022, Arxiv, DOI arXiv:2104.04896
[5]   The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening [J].
Baumann, Timo ;
Koehn, Arne ;
Hennig, Felix .
LANGUAGE RESOURCES AND EVALUATION, 2019, 53 (02) :303-329
[6]  
Bavarian Archive for Speech Signals, 1995, BAS SL100
[7]  
Bavarian Archive for Speech Signals, 2016, BAS Alcohol Language Corpus
[8]  
Bermuth D, 2021, Arxiv, DOI arXiv:2110.07982
[9]  
Bredin H, 2019, Arxiv, DOI arXiv:1911.01255
[10]  
bundestag, Nutzungsbedingungen fur das Audio-und Videomaterial des Parlamentsfernsehens