Model-free Reinforcement Learning with Stochastic Reward Stabilization for Recommender Systems

被引:0
作者
Cai, Tianchi [1 ]
Bao, Shenliao [1 ]
Jiang, Jiyan [2 ]
Zhou, Shiji [2 ]
Zhang, Wenpeng [1 ]
Gu, Lihong [1 ]
Gu, Jinjie [1 ]
Zhang, Guannan [1 ]
机构
[1] Ant Grp, Hangzhou, Peoples R China
[2] Tsinghua Univ, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023 | 2023年
关键词
Recommender System; Reinforcement Learning;
D O I
10.1145/3539618.3592022
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Model-free RL-based recommender systems have recently received increasing research attention due to their capability to handle partial feedback and long-term rewards. However, most existing research has ignored a critical feature in recommender systems: one user's feedback on the same item at different times is random. The stochastic rewards property essentially differs from that in classic RL scenarios with deterministic rewards, which makes RL-based recommender systems much more challenging. In this paper, we first demonstrate in a simulator environment where using direct stochastic feedback results in a significant drop in performance. Then to handle the stochastic feedback more efficiently, we design two stochastic reward stabilization frameworks that replace the direct stochastic feedback with that learned by a supervised model. Both frameworks are model-agnostic, i.e., they can effectively utilize various supervised models. We demonstrate the superiority of the proposed frameworks over different RL-based recommendation baselines with extensive experiments on a recommendation simulator as well as an industrial-level recommender system.
引用
收藏
页码:2179 / 2183
页数:5
相关论文
共 45 条
[1]  
Afsar M Mehdi, 2021, ACM COMPUTING SURVEY
[2]  
[Anonymous], 2016, INT C MACH LEARN
[3]  
Bai XY, 2019, ADV NEUR IN, V32
[4]   Optimization Methods for Large-Scale Machine Learning [J].
Bottou, Leon ;
Curtis, Frank E. ;
Nocedal, Jorge .
SIAM REVIEW, 2018, 60 (02) :223-311
[5]   Marketing Budget Allocation with Offline Constrained Deep Reinforcement Learning [J].
Cai, Tianchi ;
Jiang, Jiyan ;
Zhang, Wenpeng ;
Zhou, Shiji ;
Song, Xierui ;
Yu, Li ;
Gu, Lihong ;
Zeng, Xiaodong ;
Gu, Jinjie ;
Zhang, Guannan .
PROCEEDINGS OF THE SIXTEENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, WSDM 2023, VOL 1, 2023, :186-194
[6]   Off-Policy Actor-critic for Recommender Systems [J].
Chen, Minmin ;
Xu, Can ;
Gatto, Vince ;
Jain, Devanshu ;
Kumar, Aviral ;
Chi, Ed .
PROCEEDINGS OF THE 16TH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2022, 2022, :338-349
[7]   User Response Models to Improve a REINFORCE Recommender System [J].
Chen, Minmin ;
Chang, Bo ;
Xu, Can ;
Chi, Ed H. .
WSDM '21: PROCEEDINGS OF THE 14TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2021, :121-129
[8]   Top-K Off-Policy Correction for a REINFORCE Recommender System [J].
Chen, Minmin ;
Beutel, Alex ;
Covington, Paul ;
Jain, Sagar ;
Belletti, Francois ;
Chi, Ed H. .
PROCEEDINGS OF THE TWELFTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM'19), 2019, :456-464
[9]  
Chen Xiaocong, 2021, ARXIV210903540
[10]  
CHEN XY, 2019, PR MACH LEARN RES, V97