Out-of-Domain Semantics to the Rescue! Zero-Shot Hybrid Retrieval Models

被引:20
作者
Chen, Tao [1 ]
Zhang, Mingyang [1 ]
Lu, Jing [1 ]
Bendersky, Michael [1 ]
Najork, Marc [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
来源
ADVANCES IN INFORMATION RETRIEVAL, PT I | 2022年 / 13185卷
关键词
deep retrieval; lexical retrieval; zero-shot learning; hybrid model;
D O I
10.1007/978-3-030-99736-6_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The pre-trained language model (eg, BERT) based deep retrieval models achieved superior performance over lexical retrieval models (eg, BM25) in many passage retrieval tasks. However, limited work has been done to generalize a deep retrieval model to other tasks and domains. In this work, we carefully select five datasets, including two in-domain datasets and three out-of-domain datasets with different levels of domain shift, and study the generalization of a deep model in a zero-shot setting. Our findings show that the performance of a deep retrieval model is significantly deteriorated when the target domain is very different from the source domain that the model was trained on. On the contrary, lexical models are more robust across domains. We thus propose a simple yet effective framework to integrate lexical and deep retrieval models. Our experiments demonstrate that these two models are complementary, even when the deep model is weaker in the out-of-domain setting. The hybrid model obtains an average of 20.4% relative gain over the deep retrieval model, and an average of 9.54% over the lexical model in three out-of-domain datasets.
引用
收藏
页码:95 / 110
页数:16
相关论文
共 45 条
[1]   Probabilistic models of information retrieval based on measuring the divergence from randomness [J].
Amati, G ;
Van Rijsbergen, CJ .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2002, 20 (04) :357-389
[2]  
Bajaj P, 2018, MS MACRO HUMAN GENER
[3]  
Bendersky M., 2020, ABS201000200 CORR
[4]  
Berger A., 2000, SIGIR Forum, V34, P192
[5]   Reciprocal Rank Fusion outperforms Condorcet and Individual Rank Learning Methods [J].
Cormack, Gordon V. ;
Clarke, Charles L. A. ;
Buettcher, Stefan .
PROCEEDINGS 32ND ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2009, :758-759
[6]   ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search [J].
Craswell, Nick ;
Campos, Daniel ;
Mitra, Bhaskar ;
Yilmaz, Emine ;
Billerbeck, Bodo .
CIKM '20: PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, 2020, :2983-2989
[7]  
Dai Z., 2019, CoRR abs/1910.10687
[8]   Context-Aware Term Weighting For First Stage Passage Retrieval [J].
Dai, Zhuyun ;
Callan, Jamie .
PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, :1533-1536
[9]   Deeper Text Understanding for IR with Contextual Neural Language Modeling [J].
Dai, Zhuyun ;
Callan, Jamie .
PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19), 2019, :985-988
[10]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171