Off-Policy Actor-critic for Recommender Systems

被引：22

作者：

Chen, Minmin ^{[1
]}

Xu, Can ^{[2
]}

Gatto, Vince ^{[2
]}

Jain, Devanshu ^{[2
]}

Kumar, Aviral ^{[1
]}

Chi, Ed ^{[1
]}

机构：

[1] Google Res, Mountain View, CA 94043 USA

[2] Google Inc, Mountain View, CA USA

来源：

PROCEEDINGS OF THE 16TH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2022 | 2022年

关键词：

reinforcement learning; batch RL; off-policy actor-critic; pessimism; recommender systems; REINFORCEMENT; GO; GAME;

D O I：

10.1145/3523227.3546758

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Industrial recommendation platforms are increasingly concerned with how to make recommendations that cause users to enjoy their long term experience on the platform. Reinforcement learning emerged naturally as an appealing approach for its promise in 1) combating feedback loop effect resulted from myopic system behaviors; and 2) sequential planning to optimize long term outcome. Scaling RL algorithms to production recommender systems serving billions of users and contents, however remain challenging. Sample inefficiency and instability of online RL hinder its widespread adoption in production. Offline RL enables usage of off-policy data and batch learning. It on the other hand faces significant challenges in learning due to the distribution shift. A REINFORCE agent [3] was successfully tested for YouTube recommendation, significantly outperforming a sophisticated supervised learning production system. Off-policy correction was employed to learn from logged data. The algorithm partially mitigates the distribution shift by employing a one-step importance weighting. We resort to the off-policy actor critic algorithms to addresses the distribution shift to a better extent. Here we share the key designs in setting up an off-policy actor-critic agent for production recommender systems. It extends [3] with a critic network that estimates the value of any state-action pairs under the target learned policy through temporal difference learning. We demonstrate in offline and live experiments that the new framework out-performs baseline and improves long term user experience. An interesting discovery along our investigation is that recommendation agents that employ a softmax policy parameterization, can end up being too pessimistic about out-of-distribution (OOD) actions. Finding the right balance between pessimism and optimism on OOD actions is critical to the success of offline RL for recommender systems.

引用

页码：338 / 349

页数：12

共 50 条

[1] SOAC: Supervised Off-Policy Actor -Critic for Recommender Systems
Wu, Shiqing
Xu, Guandong
Wang, Xianzhi
23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, ICDM 2023, 2023, : 14121 - 14626
[2] Off-Policy Actor-Critic with Emphatic Weightings
Graves, Eric
Imani, Ehsan
Kumaraswamy, Raksha
White, Martha
JOURNAL OF MACHINE LEARNING RESEARCH, 2023, 24
[3] Meta attention for Off-Policy Actor-Critic
Huang, Jiateng
Huang, Wanrong
Lan, Long
Wu, Dan
NEURAL NETWORKS, 2023, 163 : 86 - 96
[4] Noisy Importance Sampling Actor-Critic: An Off-Policy Actor-Critic With Experience Replay
Tasfi, Norman
Capretz, Miriam
2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
[5] Neural Network Compatible Off-Policy Natural Actor-Critic Algorithm
Diddigi, Raghuram Bharadwaj
Jain, Prateek
Prabuchandran, K. J.
Bhatnagar, Shalabh
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
[6] Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors
Duan, Jingliang
Guan, Yang
Li, Shengbo Eben
Ren, Yangang
Sun, Qi
Cheng, Bo
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (11) : 6584 - 6598
[7] Improved Soft Actor-Critic: Mixing Prioritized Off-Policy Samples With On-Policy Experiences
Banerjee, Chayan
Chen, Zhiyong
Noman, Nasimul
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (03) : 3121 - 3129
[8] Supervised Advantage Actor-Critic for Recommender Systems
Xin, Xin
Karatzoglou, Alexandros
Arapakis, Ioannis
Jose, Joemon M.
WSDM'22: PROCEEDINGS OF THE FIFTEENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2022, : 1186 - 1196
[9] Finite-Sample Analysis of Off-Policy Natural Actor-Critic With Linear Function Approximation
Chen, Zaiwei
Khodadadian, Sajad
Maguluri, Siva Theja
IEEE CONTROL SYSTEMS LETTERS, 2022, 6 : 2611 - 2616
[10] Multi-agent off-policy actor-critic algorithm for distributed multi-task reinforcement learning
Stankovic, Milos S.
Beko, Marko
Ilic, Nemanja
Stankovic, Srdjan S.
EUROPEAN JOURNAL OF CONTROL, 2023, 74

← 1 2 3 4 5 →