A Simple yet Effective Framework for Active Learning to Rank

被引:0
作者
Qingzhong Wang
Haifang Li
Haoyi Xiong
Wen Wang
Jiang Bian
Yu Lu
Shuaiqiang Wang
Zhicong Cheng
Dejing Dou
Dawei Yin
机构
[1] Baidu Incorporated,
来源
Machine Intelligence Research | 2024年 / 21卷
关键词
Search; information retrieval; learning to rank; active learning; query by committee;
D O I
暂无
中图分类号
学科分类号
摘要
While China has become the largest online market in the world with approximately 1 billion internet users, Baidu runs the world’s largest Chinese search engine serving more than hundreds of millions of daily active users and responding to billions of queries per day. To handle the diverse query requests from users at the web-scale, Baidu has made tremendous efforts in understanding users’ queries, retrieving relevant content from a pool of trillions of webpages, and ranking the most relevant webpages on the top of the results. Among the components used in Baidu search, learning to rank (LTR) plays a critical role and we need to timely label an extremely large number of queries together with relevant webpages to train and update the online LTR models. To reduce the costs and time consumption of query/webpage labelling, we study the problem of active learning to rank (active LTR) that selects unlabeled queries for annotation and training in this work. Specifically, we first investigate the criterion–Ranking entropy (RE) characterizing the entropy of relevant webpages under a query produced by a sequence of online LTR models updated by different checkpoints, using a query-by-committee (QBC) method. Then, we explore a new criterion namely prediction variances (PV) that measures the variance of prediction results for all relevant webpages under a query. Our empirical studies find that RE may favor low-frequency queries from the pool for labelling while PV prioritizes high-frequency queries more. Finally, we combine these two complementary criteria as the sample selection strategies for active learning. Extensive experiments with comparisons to baseline algorithms show that the proposed approach could train LTR models to achieve higher discounted cumulative gain (i.e., the relative improvement ΔDCG4 = 1.38%) with the same budgeted labelling efforts.
引用
收藏
页码:169 / 183
页数:14
相关论文
共 34 条
[1]  
Cohn D(1994)Improving generalization with active learning Machine Learning 15 201-221
[2]  
Atlas L(1997)Selective sampling using the query by committee algorithm Machine Learning 28 133-168
[3]  
Ladner R(2015)Active learning for ranking through expected loss optimization IEEE Transactions on Knowledge and Data Engineering 27 1180-1191
[4]  
Freund Y(2015)Minimax analysis of active learning Journal of Machine Learning Research 16 3487-3602
[5]  
Seung H S(1996)Active learning with statistical models Journal of Artificial Intelligence Research 4 129-145
[6]  
Shamir E(2002)Support vector machine active learning with applications to text classification Journal of Machine Learning Research 2 45-66
[7]  
Tishby N(2015)Active learning for ranking with sample density Information Retrieval Journal 18 123-144
[8]  
Long B(2010)LETOR: A benchmark collection for research on learning to rank for information retrieval Information Retrieval 13 346-374
[9]  
Bian J(2022)On diversity in image captioning: Metrics and methods IEEE Transactions on Pattern Analysis and Machine Intelligence 44 1035-1049
[10]  
Chapelle O(2023)On distinctive image captioning via comparing and reweighting IEEE Transactions on Pattern Analysis and Machine Intelligence 45 2088-2103