Reinforcement Learning to Rank with Pairwise Policy Gradient

被引：27

作者：

Xu, Jun ^{[1
,2
]}

Wei, Zeng ^{[3
]}

Xia, Long ^{[4
]}

Lan, Yanyan ^{[5
]}

Yin, Dawei ^{[3
]}

Cheng, Xueqi ^{[5
]}

Wen, Ji-Rong ^{[1
,2
]}

机构：

[1] Renmin Univ China, Gaoling Sch Artificial Intelligence, Beijing, Peoples R China

[2] Beijing Key Lab Big Data Management & Anal Method, Beijing, Peoples R China

[3] Baidu Inc, Beijing, Peoples R China

[4] York Univ, Sch Informat Technol, York, N Yorkshire, England

[5] Inst Comp Technol, CAS Key Lab Network Data Sci & Technol, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20) | 2020年

基金：

中国国家自然科学基金;

关键词：

Learning to rank; reinforcement learning; policy gradient;

D O I：

10.1145/3397271.3401148

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper concerns reinforcement learning (RL) of the document ranking models for information retrieval (IR). One branch of the RL approaches to ranking formalize the process of ranking with Markov decision process (MDP) and determine the model parameters with policy gradient. Though preliminary success has been shown, these approaches are still far from achieving their full potentials. Existing policy gradient methods directly utilize the absolute performance scores (returns) of the sampled document lists in its gradient estimations, which may cause two limitations: 1) fail to reflect the relative goodness of documents within the same query, which usually is close to the nature of IR ranking; 2) generate high variance gradient estimations, resulting in slow learning speed and low ranking accuracy. To deal with the issues, we propose a novel policy gradient algorithm in which the gradients are determined using pairwise comparisons of two document lists sampled within the same query. The algorithm, referred to as Pairwise Policy Gradient (PPG), repeatedly samples pairs of document lists, estimates the gradients with pairwise comparisons, and finally updates the model parameters. Theoretical analysis shows that PPG makes an unbiased and low variance gradient estimations. Experimental results have demonstrated performance gains over the state-of-the-art baselines in search result diversification and text retrieval.

引用

页码：509 / 518

页数：10

共 44 条

[1]

Burges C.J.C., 2010, TECHNICAL REPORT MSR

[2]

Burges Chris, 2005, ICML, P89

[3]

Cao Z., 2007, P 24 INT C MACH LEAR, P129

[4]

Carbonell J., 1998, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P335, DOI 10.1145/290941.291025

[5]

Carmel D, 2010, SIGIR 2010: PROCEEDINGS OF THE 33RD ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH DEVELOPMENT IN INFORMATION RETRIEVAL, P911

[6]

Clarke Charles L. A., 2008, P 31 ANN INT ACM SIG, P659, DOI [DOI 10.1145/1390334.1390446, 10.1145/1390334.1390446]

[7]

Crammer K, 2002, ADV NEUR IN, V14, P641

[8]

Dang V, 2012, SIGIR 2012: PROCEEDINGS OF THE 35TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P65, DOI 10.1145/2348283.2348296

[9] Learning to Collaborate: Multi-Scenario Ranking via Multi-Agent Reinforcement Learning [J].

Feng, Jun ;

Li, Heng ;

Huang, Minlie ;

Liu, Shichen ;

Ou, Wenwu ;

Wang, Zhirong ;

Zhu, Xiaoyan .

WEB CONFERENCE 2018: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW2018), 2018, :1939-1948

[10] From Greedy Selection to Exploratory Decision-Making: Diverse Ranking with Policy-Value Networks [J].

Feng, Yue ;

Xu, Jun ;

Lan, Yanyan ;

Guo, Jiafeng ;

Zeng, Wei ;

Cheng, Xueqi .

ACM/SIGIR PROCEEDINGS 2018, 2018, :125-134

← 1 2 3 4 5 →