End-to-End Query Term Weighting

被引:0
作者
Samel, Karan [1 ]
Li, Cheng [2 ]
Kong, Weize [2 ]
Chen, Tao [2 ]
Zhang, Mingyang [2 ]
Gupta, Shaleen [2 ]
Khadanga, Swaraj [2 ]
Xu, Wensong [2 ]
Wang, Xingyu [2 ]
Kolipaka, Kashyap [2 ]
Bendersky, Michael [2 ]
Najork, Marc [2 ]
机构
[1] Georgia Tech, Atlanta, GA 30332 USA
[2] Google, Seattle, WA USA
来源
PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023 | 2023年
关键词
Information Retrieval; Query Weighting; Language Models;
D O I
10.1145/3580305.3599815
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Bag-of-words based lexical retrieval systems are still the most commonly used methods for real-world search applications. Recently deep learning methods have shown promising results to improve this retrieval performance but are expensive to run in an online fashion, non-trivial to integrate into existing production systems, and might not generalize well in out-of-domain retrieval scenarios. Instead, we build on top of lexical retrievers by proposing a Term Weighting BERT (TW-BERT) model. TW-BERT learns to predict the weight for individual n-gram (e.g., uni-grams and bi-grams) query input terms. These inferred weights and terms can be used directly by a retrieval system to perform a query search. To optimize these term weights, TW-BERT incorporates the scoring function used by the search engine, such as BM25, to score query-document pairs. Given sample query-document pairs we can compute a ranking loss over these matching scores, optimizing the learned query term weights in an end-to-end fashion. Aligning TW-BERT with search engine scorers minimizes the changes needed to integrate it into existing production applications, whereas existing deep learning based search methods would require further infrastructure optimization and hardware requirements. The learned weights can be easily utilized by standard lexical retrievers and by other retrieval techniques such as query expansion. We show that TW-BERT improves retrieval performance over strong term weighting baselines within MSMARCO and in out-of-domain retrieval on TREC datasets.
引用
收藏
页码:4778 / 4786
页数:9
相关论文
共 43 条
  • [1] Amati Giambattista, 2003, THESIS
  • [2] [Anonymous], 2010, P 3 ACM INT C WEB SE
  • [3] [Anonymous], 2019, DOC2QUERY DOCTTTTTQU, DOI DOI 10.3390/SEPARATIONS6040045
  • [4] [Anonymous], 2008, P 31 ANN INT ACM SIG
  • [5] [Anonymous], 2008, P 25 INT C MACH LEAR, DOI DOI 10.1039/B716681H
  • [6] Bai Yang, 2020, ARXIV201000768
  • [7] Bajaj P., 2016, ARXIV161109268CS
  • [8] Bendersky M., 2012, P 18 ACM C INF KNOWL, P443, DOI DOI 10.1145/2124295.2124349
  • [9] Bendersky M, 2011, PROCEEDINGS OF THE 34TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR'11), P605
  • [10] Bendersky M, 2012, SIGIR 2012: PROCEEDINGS OF THE 35TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P941, DOI 10.1145/2348283.2348408