Ancient Korean Neural Machine Translation

被引:15
作者
Park, Chanjun [1 ]
Lee, Chanhee [1 ,2 ]
Yang, Yeongwook [3 ]
Lim, Heuiseok [1 ]
机构
[1] Korea Univ, Dept Comp Sci & Engn, Seoul 02841, South Korea
[2] Amazon Alexa AI, Seattle, WA 98109 USA
[3] Univ Tartu, Inst Educ, Ctr Educ Technol, EE-50090 Tartu, Estonia
基金
新加坡国家研究基金会;
关键词
Ancient Korean translation; neural machine translation; transformer; subword tokenization; share vocabulary and entity restriction byte pair encoding; MUMMIES;
D O I
10.1109/ACCESS.2020.3004879
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Translation of the languages of ancient times can serve as a source for the content of various digital media and can be helpful in various fields such as natural phenomena, medicine, and science. Owing to these needs, there has been a global movement to translate ancient languages, but expert minds are required for this purpose. It is difficult to train language experts, and more importantly, manual translation is a slow process. Consequently, the recovery of ancient characters using machine translation has been recently investigated, but there is currently no literature on the machine translation of ancient Korean. This paper proposes the first ancient Korean neural machine translation model using a Transformer. This model can improve the efficiency of a translator by quickly providing a draft translation for a number of untranslated ancient documents. Furthermore, a new subword tokenization method called the Share Vocabulary and Entity Restriction Byte Pair Encoding is proposed based on the characteristics of ancient Korean sentences. This proposed method yields an increase in the performance of the original conventional subword tokenization methods such as byte pair encoding by 5.25 BLEU points. In addition, various decoding strategies such as n-gram blocking and ensemble models further improve the performance by 2.89 BLEU points. The model has been made publicly available as a software application.
引用
收藏
页码:116617 / 116625
页数:9
相关论文
共 33 条
  • [11] [Anonymous], ARXIV191006262
  • [12] Bahdanau D., 2014, ABS14090473 CORR
  • [13] Cho Kyunghyun, 2014, ASS COMPUT LINGUIST
  • [14] Clanuwat T., 2018, DEEP LEARNING CLASSI
  • [15] Gage P., 1994, The C Users Journal, P23, DOI DOI 10.5555/177910.177914
  • [16] Gehring J, 2017, PR MACH LEARN RES, V70
  • [17] hanseok Seo, 2013, [The Journal of Korean Classics, 민족문화], V42, P337
  • [18] Huston S., 2011, P 4 INT C WEB SEARCH, P127, DOI DOI 10.1145/1935826.1935857
  • [19] Hyunju Park, 2018, [The Journal of Translation Studies, 번역학연구], V19, P61, DOI 10.15749/jts.2018.19.1.003
  • [20] Kim H-K., 1997, Understanding Korean Literature