ASR Bundestag: A Large-Scale Political Debate Dataset in German

被引：0

作者：

Wirth, Johannes ^{[1
]}

Peinl, Rene ^{[1
]}

机构：

[1] Univ Appl Sci Hof, Alfons Goppel Pl 1, D-95028 Hof, Germany

来源：

INTELLIGENT SYSTEMS AND APPLICATIONS, VOL 2, INTELLISYS 2023 | 2024年 / 823卷

关键词：

Automatic Speech Recognition; Dataset; German;

D O I：

10.1007/978-3-031-47724-9_13

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We presentASR Bundestag, a dataset for automatic speech recognition in German, consisting of 610 h of aligned audio-transcript pairs for supervised training as well as 1,038 h of unlabeled audio snippets for self-supervised learning, based on raw audio data and transcriptions from plenary sessions and committee meetings of the German parliament. In addition, we discuss utilized approaches for the automated creation of speech datasets and assess the quality of the resulting dataset based on evaluations and finetuning of a pre-trained state of the art model. We make the dataset publicly available, including all subsets.

引用

页码：190 / 202

页数：13

共 39 条

[1]

Ardila R, 2020, PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), P4218

[2]

Baevski A, 2020, Arxiv, DOI [arXiv:2006.11477, DOI 10.48550/ARXIV.2006.11477]

[3]

Bakhturina E, 2022, Arxiv, DOI arXiv:2104.04896

[4] The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening [J].

Baumann, Timo ;

Koehn, Arne ;

Hennig, Felix .

LANGUAGE RESOURCES AND EVALUATION, 2019, 53 (02) :303-329

[5]

Bavarian Archive for Speech Signals, 1995, BAS SL100

[6]

Bavarian Archive for Speech Signals, 2016, BAS Alcohol Language Corpus

[7]

Bermuth D, 2021, Arxiv, DOI arXiv:2110.07982

[8]

Bredin H, 2019, Arxiv, DOI [arXiv:1911.01255, 10.48550/ARXIV.1911.01255 1911.01255, DOI 10.48550/ARXIV.1911.012551911.01255]

[9]

bundestag, Nutzungsbedingungen fur das Audio-und Videomaterial des Parlamentsfernsehens

[10]

Chan William, 2021, arXiv, DOI DOI 10.48550/ARXIV.2104.02133

← 1 2 3 4 →