End-to-End Deep Reinforcement Learning based Recommendation with Supervised Embedding

被引：32

作者：

Liu, Feng ^{[1
,2
,3
]}

Guo, Huifeng ^{[2
]}

Li, Xutao ^{[1
,3
]}

Tang, Ruiming ^{[2
]}

Ye, Yunming ^{[1
,3
]}

He, Xiuqiang ^{[2
]}

机构：

[1] Harbin Inst Technol, Shenzhen, Peoples R China

[2] Noahs Ark Lab, Huawei, Peoples R China

[3] Harbin Inst Technol, Shenzhen Key Lab Internet Informat Collaborat, Shenzhen, Peoples R China

来源：

PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM '20) | 2020年

基金：

国家重点研发计划;

关键词：

Recommendation; Reinforcement Learning; End-to-End; Supervised Embedding;

D O I：

10.1145/3336191.3371858

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

The research of reinforcement learning (RL) based recommendation method has become a hot topic in recommendation community, due to the recent advance in interactive recommender systems. The existing RL recommendation approaches can be summarized into a unified framework with three components, namely embedding component (EC), state representation component (SRC) and policy component (PC). We find that EC cannot be nicely trained with the other two components simultaneously. Previous studies bypass the obstacle through a pre-training and fixing strategy, which makes their approaches unlike a real end-to-end fashion. More importantly, such pre-trained and fixed EC suffers from two inherent drawbacks: (1) Pre-trained and fixed embeddings are unable to model evolving preference of users and item correlations in the dynamic environment; (2) Pre-training is inconvenient in the industrial applications. To address the problem, in this paper, we propose an End-to-end Deep Reinforcement learning based Recommendation framework (EDRR). In this framework, a supervised learning signal is carefully designed for smoothing the update gradients to EC, and three incorporating ways are introduced and compared. To the best of our knowledge, we are the first to address the training compatibility between the three components in RL based recommendations. Extensive experiments are conducted on three real-world datasets, and the results demonstrate the proposed EDRR effectively achieves the end-to-end training purpose for both policy-based and value-based RL models, and delivers better performance than state-of-the-art methods.

引用

页码：384 / 392

页数：9

共 45 条

[1]

[Anonymous], ABS180209756 CORR

[2]

[Anonymous], 2013, NIPS

[3]

[Anonymous], 2016, P INT C LEARN REPR

[4]

[Anonymous], 2015, COMPUTING RES REPOSI

[5]

[Anonymous], ABS150902971 CORR

[6]

[Anonymous], 2018, ABS181108883 CORR

[7]

[Anonymous], 2017, DeepFM: A FactorizationMachine Based Neural Network for CTR Prediction, DOI [10.24963/ijcai.2017/239, DOI 10.24963/IJCAI.2017/239]

[8]

[Anonymous], 2016, ABS160607792 CORR

[9]

[Anonymous], 2014, ICML ICML 14

[10]

[Anonymous], 2018, 2018 IEEE INT C ROB

← 1 2 3 4 5 →