Improving Bi-encoder Document Ranking Models with Two Rankers and Multi-teacher Distillation

被引:14
作者
Choi, Jaekeol [1 ,2 ]
Jung, Euna [3 ]
Suh, Jangwon [3 ]
Rhee, Wonjong [4 ]
机构
[1] Seoul Natl Univ, Seoul, South Korea
[2] Naver Corp, Seongnam Si, South Korea
[3] Seoul Natl Univ, GSCST, Seoul, South Korea
[4] Seoul Natl Univ, GSCST, GSAI, AIIS, Seoul, South Korea
来源
SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL | 2021年
关键词
Information retrieval; neural ranking model; bi-encoder; knowledge distillation; multi-teacher distillation;
D O I
10.1145/3404835.3463076
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
BERT-based Neural Ranking Models (NRMs) can be classified according to how the query and document are encoded through BERT's self-attention layers - bi-encoder versus cross-encoder. Bi-encoder models are highly efficient because all the documents can be pre-processed before the query time, but their performance is inferior compared to cross-encoder models. Both models utilize a ranker that receives BERT representations as the input and generates a relevance score as the output. In this work, we propose a method where multi-teacher distillation is applied to a cross-encoder NRM and a bi-encoder NRM to produce a bi-encoder NRM with two rankers. The resulting student bi-encoder achieves an improved performance by simultaneously learning from a cross-encoder teacher and a bi-encoder teacher and also by combining relevance scores from the two rankers. We call this method TRMD (Two Rankers and Multi-teacher Distillation). In the experiments, TwinBERT and ColBERT are considered as baseline bi-encoders. When monoBERT is used as the cross-encoder teacher, together with either TwinBERT or ColBERT as the bi-encoder teacher, TRMD produces a student bi-encoder that performs better than the corresponding baseline bi-encoder. For P@20, the maximum improvement was 11.4%, and the average improvement was 6.8%. As an additional experiment, we considered producing cross-encoder students with TRMD, and found that it could also improve the cross-encoders.(1)
引用
收藏
页码:2192 / 2196
页数:5
相关论文
共 14 条
[1]   Neural Ranking Models with Weak Supervision [J].
Dehghani, Mostafa ;
Zamani, Hamed ;
Severyn, Aliaksei ;
Kamps, Jaap ;
Croft, W. Bruce .
SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, :65-74
[2]   Efficient Knowledge Distillation from an Ensemble of Teachers [J].
Fukuda, Takashi ;
Suzuki, Masayuki ;
Kurata, Gakuto ;
Thomas, Samuel ;
Cui, Jia ;
Ramabhadran, Bhuvana .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :3697-3701
[3]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[4]  
Hinton G., 2015, ARXIV
[5]   Co-PACRR: A Context-Aware Neural IR Model for Ad-hoc Retrieval [J].
Hui, Kai ;
Yates, Andrew ;
Berberich, Klaus ;
de Melo, Gerard .
WSDM'18: PROCEEDINGS OF THE ELEVENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2018, :279-287
[6]  
Humeau S., 2020, Proceedings of the ICLR
[7]  
Huston S., 2014, Parameters learned in the comparison of retrieval models using term dependencies
[8]   ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT [J].
Khattab, Omar ;
Zaharia, Matei .
PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, :39-48
[9]  
King DB, 2015, ACS SYM SER, V1214, P1, DOI 10.1021/bk-2015-1214.ch001
[10]   TwinBERT: Distilling Knowledge to Twin-Structured Compressed BERT Models for Large-Scale Retrieval [J].
Lu, Wenhao ;
Jiao, Jian ;
Zhang, Ruofei .
CIKM '20: PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, 2020, :2645-2652