A deep actor critic reinforcement learning framework for learning to rank

被引:15
作者
Padhye, Vaibhav [1 ]
Lakshmanan, Kailasam [1 ]
机构
[1] IIT BHU, Dept Comp Sci, Varanasi, India
关键词
Reinforcement learning; Learning to Rank; Deep reinforcement learning; Policy gradient;
D O I
10.1016/j.neucom.2023.126314
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose a Deep Reinforcement learning based approach for Learning to rank task. Reinforcement Learning has been applied in the ranking task with good success, but the existing Policy Gradient based approaches suffer from noisy gradients and high variance, resulting in unstable learning. The natural policy gradient methods like REINFORCE perform Monte Carlo sampling, thus taking samples randomly, which leads to high variance. As the action space becomes large, i.e., with a very large number of documents, traditional RL techniques lack the complex model required in the scenario to deal with a large number of items. We propose a Deep Reinforcement learning based approach for learning to rank task in this paper to address these issues. By combining Deep learning with the Reinforcement Learning framework, our approach can learn a complex function as deep neural networks can provide significant function approximation. We used Actor-Critic framework where the critic network can reduce variance by utilizing techniques such as clipped delayed policy updates, clipped double q learning, etc. Also, due to the enormous space of the web, the most relevant results are needed to be returned for the cor-responding query from within a large action space. Policy gradient algorithms have been effectively applied to problems in large action spaces(items) with deep neural networks as they don't rely on finding value for each action(item) as in value-based methods. Further, we use an actor-network with a CNN layer in the ranking process to capture the sequential patterns among the documents. We utilize the TD3 method to train our Reinforcement Learning agent with a listwise loss function, which performs delayed policy updates resulting in value estimates with lower variance. To the best of our knowledge, this is the first Deep reinforcement learning method applied in Learning to Rank for document retrieval. We performed experiments on the various Letor datasets and showed that our method outperforms var-ious state-of-the-art baselines.& COPY; 2023 Elsevier B.V. All rights reserved.
引用
收藏
页数:11
相关论文
共 62 条
[1]  
[Anonymous], 2002, P INT C KNOWL DISC D
[2]   Deep Reinforcement Learning A brief survey [J].
Arulkumaran, Kai ;
Deisenroth, Marc Peter ;
Brundage, Miles ;
Bharath, Anil Anthony .
IEEE SIGNAL PROCESSING MAGAZINE, 2017, 34 (06) :26-38
[3]  
Burges C., 2005, Conference Proceeding Series, P89, DOI 10.1145/1102351.1102363
[4]  
Burges C. J., 2010, Learning, V11, P81
[5]  
Chakrabarti S., 2008, P 14 ACM SIGKDD C KN, P88, DOI [10.1145/140189, 0.1401906, DOI 10.1145/140189]
[6]  
Chen XC, 2021, Arxiv, DOI arXiv:2109.03540
[7]  
Clinchant Stephane, 2013, P 2013 C THEOR INF R, P6
[8]  
CRAMMER K, 2002, ADV NEURAL INFORM PR, V14
[9]  
Dong H., 2020, Deep Reinforcement Learning: Fundamentals, Research and Applications
[10]  
Dulac-Arnold G, 2016, Arxiv, DOI arXiv:1512.07679