Automatic Construction of Discourse Corpora for Dialogue Translation

被引:0
|
作者
Wang, Longyue [1 ]
Zhang, Xiaojun [1 ]
Tu, Zhaopeng [2 ]
Way, Andy [1 ]
Liu, Qun [1 ]
机构
[1] Dublin City Univ, ADAPT Ctr, Sch Comp, Dublin, Ireland
[2] Huawei Technol, Noah Ark Lab, Chengdu, Peoples R China
关键词
Discourse Corpus; Dialogue; Machine Translation; Information Retrieval; Movie Script; Movie Subtitle;
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
In this paper, a novel approach is proposed to automatically construct parallel discourse corpus for dialogue machine translation. Firstly, the parallel subtitle data and its corresponding monolingual movie script data are crawled and collected from Internet. Then tags such as speaker and discourse boundary from the script data are projected to its subtitle data via an information retrieval approach in order to map monolingual discourse to bilingual texts. We not only evaluate the mapping results, but also integrate speaker information into the translation. Experiments show our proposed method can achieve 81.79% and 98.64% accuracy on speaker and dialogue boundary annotation, and speaker-based language model adaptation can obtain around 0.5 BLEU points improvement in translation qualities. Finally, we publicly release around 100K parallel discourse data with manual speaker and dialogue boundary annotation.
引用
收藏
页码:2748 / 2754
页数:7
相关论文
共 50 条
  • [31] Investigating Explicitation of Discourse Connectives in Translation using Automatic Annotations
    Yung, Frances
    Scholman, Merel C. J.
    Lapshinova-Koltunski, Ekaterina
    Pollklaesener, Christina
    Demberg, Vera
    24TH MEETING OF THE SPECIAL INTEREST GROUP ON DISCOURSE AND DIALOGUE, SIGDIAL 2023, 2023, : 21 - 30
  • [32] Towards and Return: Global Dialogue and Discourse Construction of Chinese Literary Theory
    Qing, Yang
    INTERDISCIPLINARY STUDIES OF LITERATURE, 2023, 7 (02): : 242 - 256
  • [33] Corpora and LSP translation
    Kubler, Natalie
    CORPORA IN TRANSLATOR EDUCATION, 2003, : 25 - 42
  • [34] Using corpora in discourse analysis
    Duguid, Alison
    TLS-THE TIMES LITERARY SUPPLEMENT, 2006, (5407): : 32 - 32
  • [35] Using corpora in discourse analysis
    Jeyapal, Daphne
    DISCOURSE STUDIES, 2008, 10 (02) : 271 - 273
  • [36] Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text
    Antonio Jimeno Yepes
    Élise Prieur-Gaston
    Aurélie Névéol
    BMC Bioinformatics, 14
  • [37] Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text
    Yepes, Antonio Jimeno
    Prieur-Gaston, Elise
    Neveol, Aurelie
    BMC BIOINFORMATICS, 2013, 14
  • [38] Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text
    Jimeno Yepes, Antonio
    Prieur-Gaston, Élise
    Névéol, Aurélie
    BMC Bioinformatics, 2013, 14
  • [39] Synset expansion on translation graph for automatic wordnet construction
    Ercan, Gonenc
    Haziyev, Farid
    INFORMATION PROCESSING & MANAGEMENT, 2019, 56 (01) : 130 - 150
  • [40] A Brief Survey of Textual Dialogue Corpora
    Oliveira, Hugo Goncalo
    Ferreira, Patricia
    Martins, Daniel
    Silva, Catarina
    Alves, Ana
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1264 - 1274