Reinforcement Learning to Rank in E-Commerce Search Engine: Formalization, Analysis, and Application

被引：118

作者：

Hu, Yujing ^{[1
]}

Da, Qing ^{[1
]}

Zeng, Anxiang ^{[1
]}

Yu, Yang ^{[2
]}

Xu, Yinghui ^{[3
]}

机构：

[1] Alibaba Grp, Hangzhou, Zhejiang, Peoples R China

[2] Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing, Jiangsu, Peoples R China

[3] Zhejiang Cainiao Supply Chain Management Co Ltd, Artificial Intelligence Dept, Hangzhou, Zhejiang, Peoples R China

来源：

KDD'18: PROCEEDINGS OF THE 24TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING | 2018年

关键词：

reinforcement learning; online learning to rank; policy gradient; BANDITS;

D O I：

10.1145/3219819.3219846

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In E-commerce platforms such as Amazon and TaoBao, ranking items in a search session is a typical multi-step decision-making problem. Learning to rank (LTR) methods have been widely applied to ranking problems. However, such methods often consider different ranking steps in a session to be independent, which conversely may be highly correlated to each other. For better utilizing the correlation between different ranking steps, in this paper, we propose to use reinforcement learning (RL) to learn an optimal ranking policy which maximizes the expected accumulative rewards in a search session. Firstly, we formally define the concept of search session Markov decision process (SSMDP) to formulate the multi-step ranking problem. Secondly, we analyze the property of SSMDP and theoretically prove the necessity of maximizing accumulative rewards. Lastly, we propose a novel policy gradient algorithm for learning an optimal ranking policy, which is able to deal with the problem of high reward variance and unbalanced reward distribution of an SSMDP. Experiments are conducted in simulation and TaoBao search engine. The results demonstrate that our algorithm performs much better than the state-of-the-art LTR methods, with more than 40 % and 30 % growth of total transaction amount in the simulation and the real application, respectively.

引用

页码：368 / 377

页数：10

共 34 条

[1]

[Anonymous], JOE TSAI LOOKS AL RM

[2]

[Anonymous], 2005, INT C MACH LEARN

[3]

[Anonymous], 2010, INT C MACH LEARN

[4]

[Anonymous], 2016, P 4 INT C LEARN REPR

[5]

[Anonymous], 2002, P ACM SIGKDD KDD 200

[6]

[Anonymous], 2008, International Conference on Machine Learning (ICML)

[7]

[Anonymous], 2008, Advances in Neural Information Processing Systems NIPS

[8]

[Anonymous], 2008, Advance in neural information processing systems

[9]

[Anonymous], 2014, ICML ICML 14

[10]

[Anonymous], 2015, Reinforcement Learning: An Introduction

← 1 2 3 4 →