Automatic Construction of Discourse Corpora for Dialogue Translation

被引:0
|
作者
Wang, Longyue [1 ]
Zhang, Xiaojun [1 ]
Tu, Zhaopeng [2 ]
Way, Andy [1 ]
Liu, Qun [1 ]
机构
[1] Dublin City Univ, ADAPT Ctr, Sch Comp, Dublin, Ireland
[2] Huawei Technol, Noah Ark Lab, Chengdu, Peoples R China
关键词
Discourse Corpus; Dialogue; Machine Translation; Information Retrieval; Movie Script; Movie Subtitle;
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
In this paper, a novel approach is proposed to automatically construct parallel discourse corpus for dialogue machine translation. Firstly, the parallel subtitle data and its corresponding monolingual movie script data are crawled and collected from Internet. Then tags such as speaker and discourse boundary from the script data are projected to its subtitle data via an information retrieval approach in order to map monolingual discourse to bilingual texts. We not only evaluate the mapping results, but also integrate speaker information into the translation. Experiments show our proposed method can achieve 81.79% and 98.64% accuracy on speaker and dialogue boundary annotation, and speaker-based language model adaptation can obtain around 0.5 BLEU points improvement in translation qualities. Finally, we publicly release around 100K parallel discourse data with manual speaker and dialogue boundary annotation.
引用
收藏
页码:2748 / 2754
页数:7
相关论文
共 50 条
  • [1] Automatic annotation of context and speech acts for dialogue corpora
    Georgila, Kallirroi
    Lemon, Oliver
    Henderson, James
    Moore, Johanna D.
    NATURAL LANGUAGE ENGINEERING, 2009, 15 : 315 - 353
  • [2] The automatic translation of discourse structures
    Marcu, D
    Carlson, L
    Watanabe, M
    6TH APPLIED NATURAL LANGUAGE PROCESSING CONFERENCE/1ST MEETING OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE AND PROCEEDINGS OF THE ANLP-NAACL 2000 STUDENT RESEARCH WORKSHOP, 2000, : A9 - A17
  • [3] Rethinking Peace: Discourse, Memory, Translation, and Dialogue
    Reimer, Laura E.
    PEACE AND CONFLICT-JOURNAL OF PEACE PSYCHOLOGY, 2020, 26 (02) : 230 - 231
  • [4] Translation and corpora, corpora and translation
    Williams, Geoffrey
    RECHERCHE ET PRATIQUES PEDAGOGIQUES EN LANGUES DE SPECIALITE-CAHIERS DE L APLIUT, 2008, 27 (01): : 69 - 79
  • [5] Automatic dialogue segmentation using discourse chunking
    Midgley, TD
    MacNish, C
    AI 2003: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2003, 2903 : 772 - 782
  • [6] Automatic construction of English/Chinese parallel corpora
    Yang, CC
    Li, KW
    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2003, 54 (08): : 730 - 742
  • [7] Automatic discovery of translation collocations from bilingual corpora
    Barrachina, S
    Vilar, JM
    ECAI 2004: 16TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2004, 110 : 571 - 575
  • [8] Corpora and discourse studies: Integrating discourse and corpora
    Pan, Fan
    SYSTEM, 2016, 60 : 141 - 143
  • [9] Automatic filtering of bilingual corpora for statistical machine translation
    Khadivi, S
    Ney, H
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, PROCEEDINGS, 2005, 3513 : 263 - 274
  • [10] Corpora and Discourse Studies: Integrating Discourse and Corpora
    Peng Yongmei
    DISCOURSE STUDIES, 2016, 18 (06) : 768 - 769