End-to-End Deep Reinforcement Learning based Recommendation with Supervised Embedding

被引:30
作者
Liu, Feng [1 ,2 ,3 ]
Guo, Huifeng [2 ]
Li, Xutao [1 ,3 ]
Tang, Ruiming [2 ]
Ye, Yunming [1 ,3 ]
He, Xiuqiang [2 ]
机构
[1] Harbin Inst Technol, Shenzhen, Peoples R China
[2] Noahs Ark Lab, Huawei, Peoples R China
[3] Harbin Inst Technol, Shenzhen Key Lab Internet Informat Collaborat, Shenzhen, Peoples R China
来源
PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM '20) | 2020年
基金
国家重点研发计划;
关键词
Recommendation; Reinforcement Learning; End-to-End; Supervised Embedding;
D O I
10.1145/3336191.3371858
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The research of reinforcement learning (RL) based recommendation method has become a hot topic in recommendation community, due to the recent advance in interactive recommender systems. The existing RL recommendation approaches can be summarized into a unified framework with three components, namely embedding component (EC), state representation component (SRC) and policy component (PC). We find that EC cannot be nicely trained with the other two components simultaneously. Previous studies bypass the obstacle through a pre-training and fixing strategy, which makes their approaches unlike a real end-to-end fashion. More importantly, such pre-trained and fixed EC suffers from two inherent drawbacks: (1) Pre-trained and fixed embeddings are unable to model evolving preference of users and item correlations in the dynamic environment; (2) Pre-training is inconvenient in the industrial applications. To address the problem, in this paper, we propose an End-to-end Deep Reinforcement learning based Recommendation framework (EDRR). In this framework, a supervised learning signal is carefully designed for smoothing the update gradients to EC, and three incorporating ways are introduced and compared. To the best of our knowledge, we are the first to address the training compatibility between the three components in RL based recommendations. Extensive experiments are conducted on three real-world datasets, and the results demonstrate the proposed EDRR effectively achieves the end-to-end training purpose for both policy-based and value-based RL models, and delivers better performance than state-of-the-art methods.
引用
收藏
页码:384 / 392
页数:9
相关论文
共 45 条
[1]  
[Anonymous], ABS180209756 CORR
[2]  
[Anonymous], 2016, P INT C LEARN REPR
[3]  
[Anonymous], 2015, COMPUTING RES REPOSI
[4]  
[Anonymous], 2017, DeepFM: A Factorization-Machine based Neural Network for CTR Prediction, DOI [10.24963/ijcai.2017/239, DOI 10.24963/IJCAI.2017/239]
[5]  
[Anonymous], 2016, ABS160607792 CORR
[6]  
[Anonymous], 2014, ICML ICML 14
[7]  
[Anonymous], 2019, ARXIV190203987
[8]  
[Anonymous], 2018, 2018 IEEE INT C ROB
[9]  
[Anonymous], 1998, REINFORCEMENT LEARNI
[10]   Should I Follow the Crowd? A Probabilistic Analysis of the Effectiveness of Popularity in Recommender Systems [J].
Canamares, Rocio ;
Castells, Pablo .
ACM/SIGIR PROCEEDINGS 2018, 2018, :415-424