Learning a Dual-Language Vector Space for Domain-Specific Cross-Lingual Question Retrieval

被引:40
作者
Chen, Guibin [1 ]
Chen, Chunyang [1 ]
Xing, Zhenchang [1 ]
Xu, Bowen [2 ]
机构
[1] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore
[2] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou, Zhejiang, Peoples R China
来源
2016 31ST IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE) | 2016年
关键词
Cross-lingual question retrieval; Word embeddings; Convolutional Neural Network; Dual-Language Vector Space;
D O I
10.1145/2970276.2970317
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The lingual barrier limits the ability of millions of non-English speaking developers to make effective use of the tremendous knowledge in Stack Overflow, which is archived in English. For cross-lingual question retrieval, one may use translation-based methods that first translate the non-English queries into English and then perform monolingual question retrieval in English. However, translation-based methods suffer from semantic deviation due to inappropriate translation, especially for domain-specific terms, and lexical gap between queries and questions that share few words in common. To overcome the above issues, we propose a novel cross-lingual question retrieval based on word embeddings and convolutional neural network (CNN) which are the state-of-the-art deep learning techniques to capture word-and sentence-level semantics. The CNN model is trained with large amounts of examples from Stack Overflow duplicate questions and their corresponding translation by machine, which guides the CNN to learn to capture informative word and sentence features to recognize and quantify semantic similarity in the presence of semantic deviations and lexical gaps. A uniqueness of our approach is that the trained CNN can map documents in two languages (e.g., Chinese queries and English questions) in a dual-language vector space, and thus reduce the cross-lingual question retrieval problem to a simple k-nearest neighbors search problem in the dual-language vector space, where no query or question translation is required. Our evaluation shows that our approach significantly outperforms the translation-based method, and can be extended to dual-language documents retrieval from different sources.
引用
收藏
页码:744 / 755
页数:12
相关论文
共 47 条
  • [1] [Anonymous], 2015, P 19 C COMP NAT LANG, DOI DOI 10.18653/V1/K15-1013
  • [2] [Anonymous], 2014, P 22 INT C PROGR COM
  • [3] [Anonymous], 2014, P COLING 2014 25 INT, DOI DOI 10.1109/ICCAR.2017.7942788
  • [4] [Anonymous], 1998, SIGIR 98 P 21 ANN IN, DOI DOI 10.1145/290941.291008
  • [5] [Anonymous], 2014, ARXIV14093358
  • [6] Bengio Y, 2006, STUD FUZZ SOFT COMP, V194, P137
  • [7] Latent Dirichlet allocation
    Blei, DM
    Ng, AY
    Jordan, MI
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
  • [8] Improving IR-based traceability recovery via noun-based indexing of software artifacts
    Capobianco, Giovanni
    De Lucia, Andrea
    Oliveto, Rocco
    Panichella, Annibale
    Panichella, Sebastiano
    [J]. JOURNAL OF SOFTWARE-EVOLUTION AND PROCESS, 2013, 25 (07) : 743 - 762
  • [9] Mining Analogical Libraries in Q&A Discussions - Incorporating Relational and Categorical Knowledge into Word Embedding
    Chen, Chunyang
    Gao, Sa
    Xing, Zhenchang
    [J]. 2016 IEEE 23RD INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION, AND REENGINEERING (SANER), VOL 1, 2016, : 338 - 348
  • [10] Collobert R., 2008, P 25 ICML, P160, DOI [10.1145/1390156.1390177, DOI 10.1145/1390156.1390177]