Echo: A Crowd-sourced Romanian Speech Dataset

被引:0
作者
Ungureanu, Remus-Dan [1 ]
Dascalu, Mihai [1 ,2 ]
机构
[1] Natl Univ Sci & Technol Politehn Bucharest, 313 Splaiul Independentei,Sect 6, Bucharest, Romania
[2] Acad Romanian Scientists, Str Ilfov 3, Bucharest 050044, Romania
关键词
speech dataset; Romanian language; crowd-sourcing; AUTOMATIC SPEECH;
D O I
10.55612/s-5002-062-009
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Romanian is the seventh most popular European language, with around 30 million speakers worldwide. Despite its popularity, the available speech resources are limited. As a result, there are few models that transcribe Romanian well, most of them being multilingual models that also cover less popular languages. Echo is a crowd-sourcing platform that has collected more than 300 hours of speech from various contributors. In this study, we document how a large speech dataset enables researchers to train automatic speech recognition, speaker verification, and diarization models to automatically process students' notes. We publicly release both the dataset and the Whisper-based baseline model as open-source.
引用
收藏
页码:141 / 152
页数:12
相关论文
共 16 条
[1]  
Amodei D, 2016, PR MACH LEARN RES, V48
[2]  
Ardila R, 2020, Arxiv, DOI [arXiv:1912.06670, DOI 10.48550/ARXIV.1912.06670]
[3]  
Baevski A, 2020, ADV NEUR IN, V33
[4]  
Chen GG, 2021, Arxiv, DOI arXiv:2106.06909
[5]   FLEURS: FEW-SHOT LEARNING EVALUATION OF UNIVERSAL REPRESENTATIONS OF SPEECH [J].
Conneau, Alexis ;
Ma, Min ;
Khanuja, Simran ;
Zhang, Yu ;
Axelrod, Vera ;
Dalmia, Siddharth ;
Riesa, Jason ;
Rivera, Clara ;
Bapna, Ankur .
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, :798-805
[6]  
Defined.ai, Dutch spontaneous dialogue dataset
[7]  
Georgescu AL, 2018, U POLITEH BUCH SER C, V80, P45
[8]  
Georgescu AL, 2020, PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), P6606
[9]  
Hannun A, 2014, Arxiv, DOI arXiv:1412.5567
[10]  
Radford Alec, P MACHINE LEARNING R