MuST-C: a Multilingual Speech Translation Corpus

被引:0
|
作者
Di Gangi, Mattia Antonino [1 ,2 ]
Cattoni, Roldano [1 ]
Bentivogli, Luisa [1 ]
Negri, Matteo [1 ]
Turchi, Marco [1 ]
机构
[1] Fdn Bruno Kessler, Povo, Italy
[2] Univ Trento, Trento, Italy
来源
2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1 | 2019年
关键词
SEQUENCE;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current research on spoken language translation (SLT) has to confront with the scarcity of sizeable and publicly available training corpora. This problem hinders the adoption of neural end-to-end approaches, which represent the state of the art in the two parent tasks of SLT: automatic speech recognition and machine translation. To fill this gap, we created MuST-C, a multilingual speech translation corpus whose size and quality will facilitate the training of end-to-end systems for SLT from English into 8 languages. For each target language, MuST-C comprises at least 385 hours of audio recordings from English TED Talks, which are automatically aligned at the sentence level with their manual transcriptions and translations. Together with a description of the corpus creation methodology (scalable to add new data and cover new languages), we provide an empirical verification of its quality and SLT results computed with strong baseline system on each language direction.
引用
收藏
页码:2012 / 2017
页数:6
相关论文
共 50 条
  • [1] MuST-C: A multilingual corpus for end-to-end speech translation
    Cattoni, Roldano
    Di Gangi, Mattia Antonino
    Bentivogli, Luisa
    Negri, Matteo
    Turchi, Marco
    COMPUTER SPEECH AND LANGUAGE, 2021, 66
  • [2] The Multilingual TEDx Corpus for Speech Recognition and Translation
    Salesky, Elizabeth
    Wiesner, Matthew
    Bremerman, Jacob
    Cattoni, Roldano
    Negri, Matteo
    Turchi, Marco
    Oard, Douglas W.
    Post, Matt
    INTERSPEECH 2021, 2021, : 3655 - 3659
  • [3] CVSS Corpus and Massively Multilingual Speech-to-Speech Translation
    Jia, Ye
    Ramanovich, Michelle Tadmor
    Wang, Quan
    Zen, Heiga
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6691 - 6703
  • [4] CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus
    Wang, Changhan
    Pino, Juan
    Wu, Anne
    Gu, Jiatao
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 4197 - 4203
  • [5] EUROPARL-ST: A MULTILINGUAL CORPUS FOR SPEECH TRANSLATION OF PARLIAMENTARY DEBATES
    Iranzo-Sanchez, Javier
    Albert Silvestre-Cerda, Joan
    Jorge, Javier
    Rosello, Nahuel
    Gimenez, Adria
    Sanchis, Albert
    Civera, Jorge
    Juan, Alfons
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8229 - 8233
  • [6] Euronews: a multilingual speech corpus for ASR
    Gretter, Roberto
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 2635 - 2638
  • [7] MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation
    Anwar, Mohamed
    Shi, Bowen
    Goswami, Vedanuj
    Hsu, Wei-Ning
    Pino, Juan
    Wang, Changhan
    INTERSPEECH 2023, 2023, : 4064 - 4068
  • [8] The Multilingual Student Translation corpus: a resource for translation teaching and research
    Sylviane Granger
    Marie-Aude Lefer
    Language Resources and Evaluation, 2020, 54 : 1183 - 1199
  • [9] The Multilingual Student Translation corpus: a resource for translation teaching and research
    Granger, Sylviane
    Lefer, Marie-Aude
    LANGUAGE RESOURCES AND EVALUATION, 2020, 54 (04) : 1183 - 1199
  • [10] Development and Application of Multilingual Speech Translation
    Nakamura, Satoshi
    ORIENTAL COCOSDA 2009 - INTERNATIONAL CONFERENCE ON SPEECH DATABASE AND ASSESSMENTS, 2009, : 9 - 12