InPars: Unsupervised Dataset Generation for Information Retrieval

被引:46
作者
Bonifacio, Luiz [1 ,2 ,3 ]
Abonizio, Hugo [1 ,2 ]
Fadaee, Marzieh [1 ]
Nogueira, Rodrigo [1 ,2 ,3 ,4 ]
机构
[1] Zeta Alpha, Amsterdam, Netherlands
[2] NeuralMind, Campinas, SP, Brazil
[3] Univ Estadual Campinas, Campinas, SP, Brazil
[4] Univ Waterloo, Waterloo, ON, Canada
来源
PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22) | 2022年
基金
巴西圣保罗研究基金会;
关键词
Few-shot Models; Large Language Models; Generative Models; Question Generation; Synthetic Datasets; Multi-stage Ranking;
D O I
10.1145/3477495.3531863
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The Information Retrieval (IR) community has recently witnessed a revolution due to large pretrained transformer models. Another key ingredient for this revolution was the MS MARCO dataset, whose scale and diversity has enabled zero-shot transfer learning to various tasks. However, not all IR tasks and domains can benefit from one single dataset equally. Extensive research in various NLP tasks has shown that using domain-specific training data, as opposed to a general-purpose one, improves the performance of neural models [45, 56]. In this work, we harness the few-shot capabilities of large pretrained language models as synthetic data generators for IR tasks. We show that models fine-tuned solely on our synthetic datasets outperform strong base-lines such as BM25 as well as recently proposed self-supervised dense retrieval methods. Code, models, and data are available at https://github.com/zetaalphavector/inpars.
引用
收藏
页码:2387 / 2392
页数:6
相关论文
共 57 条
[21]   Natural Questions: A Benchmark for Question Answering Research [J].
Kwiatkowski, Tom ;
Palomaki, Jennimaria ;
Redfield, Olivia ;
Collins, Michael ;
Parikh, Ankur ;
Alberti, Chris ;
Epstein, Danielle ;
Polosukhin, Illia ;
Devlin, Jacob ;
Lee, Kenton ;
Toutanova, Kristina ;
Jones, Llion ;
Kelcey, Matthew ;
Chang, Ming-Wei ;
Dai, Andrew M. ;
Uszkoreit, Jakob ;
Quoc Le ;
Petrov, Slav .
TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2019, 7 :453-466
[22]   Pyserini: A Python']Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations [J].
Lin, Jimmy ;
Ma, Xueguang ;
Lin, Sheng-Chieh ;
Yang, Jheng-Hong ;
Pradeep, Ronak ;
Nogueira, Rodrigo .
SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, :2356-2362
[23]  
Liu Alisa, 2022, ARXIV220105955CSCL
[24]   Cascade Ranking for Operational E-commerce Search [J].
Liu, Shichen ;
Xiao, Fei ;
Ou, Wenwu ;
Si, Luo .
KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2017, :1557-1565
[25]  
Ma J, 2021, 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), P1075
[26]   WWW'18 Open Challenge: Financial Opinion Mining and Question Answering [J].
Maia, Macedo ;
Handschuh, Siegfried ;
Freitas, Andre ;
Davis, Brian ;
McDermott, Ross ;
Zarrouk, Manel ;
Balahur, Alexandra .
COMPANION PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE 2018 (WWW 2018), 2018, :1941-1942
[27]  
Meng Yu, 2022, ARXIV220204538CSCL
[28]  
Mohapatra B, 2021, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, P1190
[29]   A Systematic Evaluation of Transfer Learning and Pseudo-labeling with BERT-based Ranking Models [J].
Mokrii, Iurii ;
Boytsov, Leonid ;
Braslavski, Pavel .
SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, :2081-2085
[30]  
Neelakantan Arvind, 2022, ARXIV220110005