Why is a document relevant? Understanding the relevance scores in cross-lingual document retrieval

被引:10
作者
Novak, Erik [1 ,2 ]
Bizjak, Luka [1 ]
Mladenic, Dunja [1 ,2 ]
Grobelnik, Marko [1 ]
机构
[1] Jozef Stefan Inst, Jamova Cesta 39, Ljubljana 1000, Slovenia
[2] Jozef Stefan Int Postgrad Sch, Jamova Cesta 39, Ljubljana 1000, Slovenia
基金
欧盟地平线“2020”;
关键词
Cross-lingual information retrieval; Language model; Optimal transport; Result interpretability; Natural language processing;
D O I
10.1016/j.knosys.2022.108545
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Modern cross-lingual document retrieval models are capable of finding documents relevant to the query. However, they do not have the capabilities for explaining why the document is relevant. This paper proposes a novel learning-to-rank model named LM-EMD that uses the multilingual BERT language model and Earth Mover's Distance (EMD) to measure the document's relevancy to the input query and provide interpretable insights into why a document is relevant. The model uses the query and document token's contextual embeddings generated with multilingual BERT to measure their distances in the embedding space, which are then used by EMD to calculate the document's relevance score and identify which document tokens contribute the most to its relevancy. We evaluate the model on five language pairs of varying degrees of similarity and analyze its performance. We find that the model (1) performs similar as the best performing comparing model on high-resource languages, (2) is less effective on low-resource languages, and (3) provides insight into why a document is relevant to the query. (C) 2022 The Author(s). Published by Elsevier B.V.
引用
收藏
页数:12
相关论文
共 64 条
[1]  
Artetxe Mikel, 2016, Empirical Methods in Natural Language Processing (EMNLP), P2289, DOI [10.18653/v1/D16-1250, DOI 10.18653/V1/D16-1250]
[2]   Query expansion techniques for information retrieval: A survey [J].
Azad, Hiteshwar Kumar ;
Deepak, Akshay .
INFORMATION PROCESSING & MANAGEMENT, 2019, 56 (05) :1698-1735
[3]  
Bajaj P., 2016, P WORKSH COGN COMP I
[4]  
Bojanowski P., 2017, T ASSOC COMPUT LING, V5, P135, DOI DOI 10.1162/TACL_A_00051
[6]  
Conneau Alexis, 2020, P 58 ANN M ASS COMPU, P8440, DOI [10.18653/v1/2020.acl-main.747, DOI 10.18653/V1/2020.ACL-MAIN.747]
[7]  
Conneau Alexis, 2018, INT C LEARN REPR ICL
[8]  
Cuturi M., 2013, Advances in Neural Information Processing Systems, V26
[9]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[10]   DISTANCES OF PROBABILITY MEASURES AND RANDOM VARIABLES [J].
DUDLEY, RM .
ANNALS OF MATHEMATICAL STATISTICS, 1968, 39 (05) :1563-&