MuST-C: a Multilingual Speech Translation Corpus

被引：0

作者：

Di Gangi, Mattia Antonino ^{[1
,2
]}

Cattoni, Roldano ^{[1
]}

Bentivogli, Luisa ^{[1
]}

Negri, Matteo ^{[1
]}

Turchi, Marco ^{[1
]}

机构：

[1] Fdn Bruno Kessler, Povo, Italy

[2] Univ Trento, Trento, Italy

来源：

2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1 | 2019年

关键词：

SEQUENCE;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Current research on spoken language translation (SLT) has to confront with the scarcity of sizeable and publicly available training corpora. This problem hinders the adoption of neural end-to-end approaches, which represent the state of the art in the two parent tasks of SLT: automatic speech recognition and machine translation. To fill this gap, we created MuST-C, a multilingual speech translation corpus whose size and quality will facilitate the training of end-to-end systems for SLT from English into 8 languages. For each target language, MuST-C comprises at least 385 hours of audio recordings from English TED Talks, which are automatically aligned at the sentence level with their manual transcriptions and translations. Together with a description of the corpus creation methodology (scalable to add new data and cover new languages), we provide an empirical verification of its quality and SLT results computed with strong baseline system on each language direction.

引用

页码：2012 / 2017

页数：6

共 50 条

[1] MuST-C: A multilingual corpus for end-to-end speech translation
Cattoni, Roldano
Di Gangi, Mattia Antonino
Bentivogli, Luisa
Negri, Matteo
Turchi, Marco
COMPUTER SPEECH AND LANGUAGE, 2021, 66
[2] The Multilingual TEDx Corpus for Speech Recognition and Translation
Salesky, Elizabeth
Wiesner, Matthew
Bremerman, Jacob
Cattoni, Roldano
Negri, Matteo
Turchi, Marco
Oard, Douglas W.
Post, Matt
INTERSPEECH 2021, 2021, : 3655 - 3659
[3] CVSS Corpus and Massively Multilingual Speech-to-Speech Translation
Jia, Ye
Ramanovich, Michelle Tadmor
Wang, Quan
Zen, Heiga
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6691 - 6703
[4] CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus
Wang, Changhan
Pino, Juan
Wu, Anne
Gu, Jiatao
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 4197 - 4203
[5] EUROPARL-ST: A MULTILINGUAL CORPUS FOR SPEECH TRANSLATION OF PARLIAMENTARY DEBATES
Iranzo-Sanchez, Javier
Albert Silvestre-Cerda, Joan
Jorge, Javier
Rosello, Nahuel
Gimenez, Adria
Sanchis, Albert
Civera, Jorge
Juan, Alfons
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8229 - 8233
[6] Euronews: a multilingual speech corpus for ASR
Gretter, Roberto
LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 2635 - 2638
[7] MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation
Anwar, Mohamed
Shi, Bowen
Goswami, Vedanuj
Hsu, Wei-Ning
Pino, Juan
Wang, Changhan
INTERSPEECH 2023, 2023, : 4064 - 4068
[8] The Multilingual Student Translation corpus: a resource for translation teaching and research
Sylviane Granger
Marie-Aude Lefer
Language Resources and Evaluation, 2020, 54 : 1183 - 1199
[9] The Multilingual Student Translation corpus: a resource for translation teaching and research
Granger, Sylviane
Lefer, Marie-Aude
LANGUAGE RESOURCES AND EVALUATION, 2020, 54 (04) : 1183 - 1199
[10] Development and Application of Multilingual Speech Translation
Nakamura, Satoshi
ORIENTAL COCOSDA 2009 - INTERNATIONAL CONFERENCE ON SPEECH DATABASE AND ASSESSMENTS, 2009, : 9 - 12

← 1 2 3 4 5 →