InPars: Unsupervised Dataset Generation for Information Retrieval

被引:42
作者
Bonifacio, Luiz [1 ,2 ,3 ]
Abonizio, Hugo [1 ,2 ]
Fadaee, Marzieh [1 ]
Nogueira, Rodrigo [1 ,2 ,3 ,4 ]
机构
[1] Zeta Alpha, Amsterdam, Netherlands
[2] NeuralMind, Campinas, SP, Brazil
[3] Univ Estadual Campinas, Campinas, SP, Brazil
[4] Univ Waterloo, Waterloo, ON, Canada
来源
PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22) | 2022年
基金
巴西圣保罗研究基金会;
关键词
Few-shot Models; Large Language Models; Generative Models; Question Generation; Synthetic Datasets; Multi-stage Ranking;
D O I
10.1145/3477495.3531863
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The Information Retrieval (IR) community has recently witnessed a revolution due to large pretrained transformer models. Another key ingredient for this revolution was the MS MARCO dataset, whose scale and diversity has enabled zero-shot transfer learning to various tasks. However, not all IR tasks and domains can benefit from one single dataset equally. Extensive research in various NLP tasks has shown that using domain-specific training data, as opposed to a general-purpose one, improves the performance of neural models [45, 56]. In this work, we harness the few-shot capabilities of large pretrained language models as synthetic data generators for IR tasks. We show that models fine-tuned solely on our synthetic datasets outperform strong base-lines such as BM25 as well as recently proposed self-supervised dense retrieval methods. Code, models, and data are available at https://github.com/zetaalphavector/inpars.
引用
收藏
页码:2387 / 2392
页数:6
相关论文
共 57 条
[1]  
Anaby-Tavor A, 2020, AAAI CONF ARTIF INTE, V34, P7383
[2]  
[Anonymous], KDD17 P 23 ACM
[3]  
Brown TB, 2020, ADV NEUR IN, V33
[4]   Efficient Cost-Aware Cascade Ranking in Multi-Stage Retrieval [J].
Chen, Ruey-Cheng ;
Gallagher, Luke ;
Blanco, Roi ;
Culpepper, J. Shane .
SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, :445-454
[5]  
Craswell N., 2021, CoRR abs/2102.07662
[6]   Deeper Text Understanding for IR with Contextual Neural Language Modeling [J].
Dai, Zhuyun ;
Callan, Jamie .
PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19), 2019, :985-988
[7]  
Demidenko GV, 2006, PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON BIOINFORMATICS OF GENOME REGULATION AND STRUCTURE, VOL 3, P43
[8]   Data Augmentation for Low-Resource Neural Machine Translation [J].
Fadaee, Marzieh ;
Bisazza, Arianna ;
Monz, Christof .
PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 2, 2017, :567-573
[9]  
Gao Luyu, 2021, ARXIV210805540
[10]  
Han Jesse Michael, 2021, ARXIV211005448