Pyserini: A Python']Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations

被引:176
作者
Lin, Jimmy [1 ]
Ma, Xueguang [1 ]
Lin, Sheng-Chieh [1 ]
Yang, Jheng-Hong [1 ]
Pradeep, Ronak [1 ]
Nogueira, Rodrigo [1 ]
机构
[1] Univ Waterloo, David R Cheriton Sch Comp Sci, Waterloo, ON, Canada
来源
SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL | 2021年
基金
加拿大自然科学与工程研究理事会;
关键词
Open-Source Search Engine; First-Stage Retrieval;
D O I
10.1145/3404835.3463238
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. It aims to provide effective, reproducible, and easy-to-use first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with queries, relevance judgments, pre-built indexes, and evaluation scripts for many commonly used IR test collections. We aim to support, out of the box, the entire research lifecycle of efforts aimed at improving ranking with modern neural approaches. In particular, Pyserini supports sparse retrieval (e.g., BM25 scoring using bag-of-words representations), dense retrieval (e.g., nearest-neighbor search on transformer-encoded representations), as well as hybrid retrieval that integrates both approaches. This paper provides an overview of toolkit features and presents empirical results that illustrate its effectiveness on two popular ranking tasks. Around this toolkit, our group has built a culture of reproducibility through shared norms and tools that enable rigorous automated testing.
引用
收藏
页码:2356 / 2362
页数:7
相关论文
共 44 条
  • [1] Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
  • [2] Arguello Jaime, 2015, REP SIGIR 2015 WORKS, V49, P107
  • [3] Asadi N, 2013, SIGIR'13: THE PROCEEDINGS OF THE 36TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH & DEVELOPMENT IN INFORMATION RETRIEVAL, P997
  • [4] Bajaj Payal, 2018, ARXIV161109268V3
  • [5] Bendersky Michael, 2020, ARXIV201000200
  • [6] Esteva A., 2020, ARXIV200609595
  • [7] Grand Adrien, 2020, Advances in Information Retrieval. 42nd European Conference on IR Research, ECIR 2020. Proceedings. Lecture Notes in Computer Science (LNCS 12036), P20, DOI 10.1007/978-3-030-45442-5_3
  • [8] Hofstatter Sebastian, 2021, ARXIV201002666
  • [9] Hofstatter Sebastian, 2020, ARXIV201002666
  • [10] Johnson J., 2017, BILLION SCALE SIMILA