Sentence Level Alignment of Digitized Books Parallel Corpora

被引:3
作者
Laukaitis, Algirdas [1 ]
Plikynas, Darius [2 ]
Ostasius, Egidijus [1 ]
机构
[1] Vilnius Gediminas Tech Univ, Fundamental Sci Fac, Vilnius, Lithuania
[2] Vilnius Univ, Inst Data Sci & Digital Technol, Vilnius, Lithuania
关键词
alignment of corpora; alignment of digitized books; machine translation; natural language processing;
D O I
10.15388/Informatica.2018.188
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we propose a framework for extracting translation memory from a corpus of fiction and non-fiction books. In recent years, there have been several proposals to align bilingual corpus and extract translation memory from legal and technical documents. Yet, when it comes to an alignment of the corpus of translated fiction and non-fiction books, the existing alignment algorithms give low precision results. In order to solve this low precision problem, we propose a new method that incorporates existing alignment algorithms with proactive learning approach. We define several feature functions that are used to build two classifiers for text filtering and alignment. We report results on English-Lithuanian language pair and on bilingual corpus from 200 books. We demonstrate a significant improvement in alignment accuracy over currently available alignment systems.
引用
收藏
页码:693 / 710
页数:18
相关论文
共 20 条
  • [1] [Anonymous], 2015, LINGUISTIC ISSUES LA
  • [2] [Anonymous], 1991, P 29 ANN M ASS COMP
  • [3] [Anonymous], 31 ANN M ASS COMP LI
  • [4] [Anonymous], 2008, P C EMP METH NAT LAN, DOI DOI 10.3115/1613715.1613855
  • [5] [Anonymous], 2001, PROC 18 INT C MACH L
  • [6] Statistical Approaches to Computer-Assisted Translation
    Barrachina, Sergio
    Bender, Oliver
    Casacuberta, Francisco
    Civera, Jorge
    Cubel, Elsa
    Khadivi, Shahram
    Lagarda, Antonio
    Ney, Hermann
    Tomas, Jesus
    Vidal, Enrique
    Vilar, Juan-Miguel
    [J]. COMPUTATIONAL LINGUISTICS, 2009, 35 (01) : 3 - 28
  • [7] Berger AL, 1996, COMPUT LINGUIST, V22, P39
  • [8] Braune F., 2010, COLING 2010 POSTERS, P81
  • [9] Brown P. F., 1993, Computational Linguistics, V19, P263
  • [10] Gale W. A., 1993, Computational Linguistics, V19, P75