Off-Policy Actor-critic for Recommender Systems

被引:22
作者
Chen, Minmin [1 ]
Xu, Can [2 ]
Gatto, Vince [2 ]
Jain, Devanshu [2 ]
Kumar, Aviral [1 ]
Chi, Ed [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
[2] Google Inc, Mountain View, CA USA
来源
PROCEEDINGS OF THE 16TH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2022 | 2022年
关键词
reinforcement learning; batch RL; off-policy actor-critic; pessimism; recommender systems; REINFORCEMENT; GO; GAME;
D O I
10.1145/3523227.3546758
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Industrial recommendation platforms are increasingly concerned with how to make recommendations that cause users to enjoy their long term experience on the platform. Reinforcement learning emerged naturally as an appealing approach for its promise in 1) combating feedback loop effect resulted from myopic system behaviors; and 2) sequential planning to optimize long term outcome. Scaling RL algorithms to production recommender systems serving billions of users and contents, however remain challenging. Sample inefficiency and instability of online RL hinder its widespread adoption in production. Offline RL enables usage of off-policy data and batch learning. It on the other hand faces significant challenges in learning due to the distribution shift. A REINFORCE agent [3] was successfully tested for YouTube recommendation, significantly outperforming a sophisticated supervised learning production system. Off-policy correction was employed to learn from logged data. The algorithm partially mitigates the distribution shift by employing a one-step importance weighting. We resort to the off-policy actor critic algorithms to addresses the distribution shift to a better extent. Here we share the key designs in setting up an off-policy actor-critic agent for production recommender systems. It extends [3] with a critic network that estimates the value of any state-action pairs under the target learned policy through temporal difference learning. We demonstrate in offline and live experiments that the new framework out-performs baseline and improves long term user experience. An interesting discovery along our investigation is that recommendation agents that employ a softmax policy parameterization, can end up being too pessimistic about out-of-distribution (OOD) actions. Finding the right balance between pessimism and optimism on OOD actions is critical to the success of offline RL for recommender systems.
引用
收藏
页码:338 / 349
页数:12
相关论文
共 50 条
  • [21] Optimal Actor-Critic Policy With Optimized Training Datasets
    Banerjee, Chayan
    Chen, Zhiyong
    Noman, Nasimul
    Zamani, Mohsen
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2022, 6 (06): : 1324 - 1334
  • [22] ADVERSARIAL ADVANTAGE ACTOR-CRITIC MODEL FOR TASK-COMPLETION DIALOGUE POLICY LEARNING
    Peng, Baolin
    Li, Xiujun
    Gao, Jianfeng
    Liu, Jingjing
    Chen, Yun-Nung
    Wong, Kam-Fai
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6149 - 6153
  • [23] Looking Back on the Actor-Critic Architecture
    Barto, Andrew G.
    Sutton, Richard S.
    Anderson, Charles W.
    IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2021, 51 (01): : 40 - 50
  • [24] An actor-critic based recommender system with context-aware user modeling
    Bukhari, Maryam
    Maqsood, Muazzam
    Adil, Farhan
    ARTIFICIAL INTELLIGENCE REVIEW, 2025, 58 (05)
  • [25] Variational actor-critic algorithms*,**
    Zhu, Yuhua
    Ying, Lexing
    ESAIM-CONTROL OPTIMISATION AND CALCULUS OF VARIATIONS, 2023, 29
  • [26] Error controlled actor-critic
    Gao, Xingen
    Chao, Fei
    Zhou, Changle
    Ge, Zhen
    Yang, Longzhi
    Chang, Xiang
    Shang, Changjing
    Shen, Qiang
    INFORMATION SCIENCES, 2022, 612 : 62 - 74
  • [27] Actor-critic algorithm with incremental dual natural policy gradient
    Zhang P.
    Liu Q.
    Zhong S.
    Zhai J.-W.
    Qian W.-S.
    2017, Editorial Board of Journal on Communications (38): : 166 - 177
  • [28] The Effect of Discounting Actor-loss in Actor-Critic Algorithm
    Yaputra, Jordi
    Suyanto, Suyanto
    2021 4TH INTERNATIONAL SEMINAR ON RESEARCH OF INFORMATION TECHNOLOGY AND INTELLIGENT SYSTEMS (ISRITI 2021), 2020,
  • [29] Master-Slave Policy Collaboration for Actor-Critic Methods
    Li, Xiaomu
    Liu, Quan
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [30] Actor-Critic or Critic-Actor? A Tale of Two Time Scales
    Bhatnagar, Shalabh
    Borkar, Vivek S.
    Guin, Soumyajit
    IEEE CONTROL SYSTEMS LETTERS, 2023, 7 : 2671 - 2676