Context-based Translation for the Out of Vocabulary Words Applied to Hindi-English Cross-Lingual Information Retrieval

被引:11
作者
Sharma, Vijay Kumar [1 ]
Mittal, Namita [1 ]
Vidyarthi, Ankit [2 ]
机构
[1] Malaviya Natl Inst Technol, Dept Comp Sci & Engn, Jaipur, Rajasthan, India
[2] Jaypee Inst Informat Technol, Noida, India
关键词
Neural machine translation; Out of vocabulary words; Parallel corpus; Recurrent neural network; Statistical machine translation; Word embedding;
D O I
10.1080/02564602.2020.1843553
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Cross-Lingual Information Retrieval (CLIR) provides flexibility to users to query in their regional (source) languages regardless the target documents languages. CLIR uses trending translation techniques Statistical Machine Translation (SMT) and Neural Machine Translation (NMT). SMT and NMT achieve good results for foreign languages but not for Indian languages due to non-absoluteness of the parallel corpus. Source language user queries may contain the Out Of Vocabulary (OOV) words which are not present in the parallel corpus such words may be skipped without performing translation by SMT. In this paper, a context-based translation algorithm is proposed to translate the OOV words by utilizing two unlabeled & unrelated large raw corpora (in source and target language) and a small bi-lingual parallel corpus. Since SMT performs better than NMT for Hindi to English translation as per the literature, therefore, experimental results are evaluated for FIRE datasets against baseline SMT. The proposed algorithm improves evaluation measures, Recall up to 6.04% (0.8785) for FIRE 2010 and up to 3.96% (0.7365) for FIRE 2011, & Mean Average Precision (MAP) up to 14.37% (0.3239) for FIRE 2010 and up to 5.46% (0.1988) for FIRE 2011, in comparison to the baseline SMT which achieves 0.8284 and 0.7084 Recall for FIRE 2010 and 2011, & 0.2832 and 0.1885 MAP for FIRE 2010 and 2011. An analysis for the number of OOV words shows that the proposed algorithm reduces the number of OOV more effectively, up to 0.81% for FIRE 2010 and 1.73% for FIRE 2011.
引用
收藏
页码:276 / 285
页数:10
相关论文
共 48 条
  • [1] AKHTAR SS, 2017, ARXIV171105678
  • [2] [Anonymous], 2015, ARXIV PREPRINT ARXIV
  • [3] [Anonymous], 2011, ACM Transactions on Information Systems
  • [4] [Anonymous], INT J ASIAN LANG P
  • [5] Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
  • [6] Bhattacharya P, 2016, COMPUT SIST, V20, P435, DOI [10.13053/cys-20-3-2462, 10.13053/CyS-20-3-2462]
  • [7] Bojar O, 2014, LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P3550
  • [8] CHINNAKOTLA MK, 2007, WORKSH CROSS LANG EV, P111
  • [9] Dakwale Praveen, 2017, Prague Bulletin of Mathematical Linguistics, P37, DOI 10.1515/pralin-2017-0007
  • [10] Ganesh Surya, 2008, P 2 WORKSH CROSS LIN