DynamicRetriever: A Pre-trained Model-based IR System Without an Explicit Index

被引:14
作者
Zhou, Yu-Jia [1 ]
Yao, Jing [1 ]
Dou, Zhi-Cheng [1 ]
Wu, Ledell [2 ]
Wen, Ji-Rong [1 ]
机构
[1] Renmin Univ China, Gaoling Sch Artificial Intelligence, Beijing 100872, Peoples R China
[2] Beijing Acad Artificial Intelligence, Beijing 100084, Peoples R China
基金
中国国家自然科学基金;
关键词
Information retrieval (IR); document retrieval; model-based IR; pre-trained language model; differentiable search index;
D O I
10.1007/s11633-022-1373-9
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Web search provides a promising way for people to obtain information and has been extensively studied. With the surge of deep learning and large-scale pre-training techniques, various neural information retrieval models are proposed, and they have demonstrated the power for improving search (especially, the ranking) quality. All these existing search methods follow a common paradigm, i.e., index-retrieve-rerank, where they first build an index of all documents based on document terms (i.e., sparse inverted index) or representation vectors (i.e., dense vector index), then retrieve and rerank retrieved documents based on the similarity between the query and documents via ranking models. In this paper, we explore a new paradigm of information retrieval without an explicit index but only with a pre-trained model. Instead, all of the knowledge of the documents is encoded into model parameters, which can be regarded as a differentiable indexer and optimized in an end-to-end manner. Specifically, we propose a pre-trained model-based information retrieval (IR) system called DynamicRetriever, which directly returns document identifiers for a given query. Under such a framework, we implement two variants to explore how to train the model from scratch and how to combine the advantages of dense retrieval models. Compared with existing search methods, the model-based IR system parameterizes the traditional static index with a pre-training model, which converts the document semantic mapping into a dynamic and updatable process. Extensive experiments conducted on the public search benchmark Microsoft machine reading comprehension (MS MARCO) verify the effectiveness and potential of our proposed new paradigm for information retrieval.
引用
收藏
页码:276 / 288
页数:13
相关论文
共 41 条
[1]  
Callan J. P., 1994, SIGIR '94. Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, P302
[2]  
Chang Wei-Cheng, 2020, P 8 INT C LEARNING R
[3]   GERE: Generative Evidence Retrieval for Fact Verification [J].
Chen, Jiangui ;
Zhang, Ruqing ;
Guo, Jiafeng ;
Fan, Yixing ;
Cheng, Xueqi .
PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, :2184-2189
[4]  
Clark Kevin, 2020, P 8 INT C LEARNING R
[5]  
Dai ZY, 2019, Arxiv, DOI arXiv:1910.10687
[6]   Context-Aware Document Term Weighting for Ad-Hoc Search [J].
Dai, Zhuyun ;
Callan, Jamie .
WEB CONFERENCE 2020: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2020), 2020, :1897-1907
[7]   Convolutional Neural Networks for Soft-Matching N-Grams in Ad-hoc Search [J].
Dai, Zhuyun ;
Xiong, Chenyan ;
Callan, Jamie ;
Liu, Zhiyuan .
WSDM'18: PROCEEDINGS OF THE ELEVENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2018, :126-134
[8]  
De Cao N., 2021, P 9 INT C LEARNING R
[9]   Neural Ranking Models with Weak Supervision [J].
Dehghani, Mostafa ;
Zamani, Hamed ;
Severyn, Aliaksei ;
Kamps, Jaap ;
Croft, W. Bruce .
SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, :65-74
[10]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171