Off-Policy Actor-critic for Recommender Systems

被引:22
作者
Chen, Minmin [1 ]
Xu, Can [2 ]
Gatto, Vince [2 ]
Jain, Devanshu [2 ]
Kumar, Aviral [1 ]
Chi, Ed [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
[2] Google Inc, Mountain View, CA USA
来源
PROCEEDINGS OF THE 16TH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2022 | 2022年
关键词
reinforcement learning; batch RL; off-policy actor-critic; pessimism; recommender systems; REINFORCEMENT; GO; GAME;
D O I
10.1145/3523227.3546758
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Industrial recommendation platforms are increasingly concerned with how to make recommendations that cause users to enjoy their long term experience on the platform. Reinforcement learning emerged naturally as an appealing approach for its promise in 1) combating feedback loop effect resulted from myopic system behaviors; and 2) sequential planning to optimize long term outcome. Scaling RL algorithms to production recommender systems serving billions of users and contents, however remain challenging. Sample inefficiency and instability of online RL hinder its widespread adoption in production. Offline RL enables usage of off-policy data and batch learning. It on the other hand faces significant challenges in learning due to the distribution shift. A REINFORCE agent [3] was successfully tested for YouTube recommendation, significantly outperforming a sophisticated supervised learning production system. Off-policy correction was employed to learn from logged data. The algorithm partially mitigates the distribution shift by employing a one-step importance weighting. We resort to the off-policy actor critic algorithms to addresses the distribution shift to a better extent. Here we share the key designs in setting up an off-policy actor-critic agent for production recommender systems. It extends [3] with a critic network that estimates the value of any state-action pairs under the target learned policy through temporal difference learning. We demonstrate in offline and live experiments that the new framework out-performs baseline and improves long term user experience. An interesting discovery along our investigation is that recommendation agents that employ a softmax policy parameterization, can end up being too pessimistic about out-of-distribution (OOD) actions. Finding the right balance between pessimism and optimism on OOD actions is critical to the success of offline RL for recommender systems.
引用
收藏
页码:338 / 349
页数:12
相关论文
共 50 条
  • [31] SOFT ACTOR-CRITIC ALGORITHM WITH ADAPTIVE NORMALIZATION
    Gao, Xiaonan
    Wu, Ziyi
    Zhu, Xianchao
    Cai, Lei
    JOURNAL OF NONLINEAR FUNCTIONAL ANALYSIS, 2025, 2025
  • [32] Better Exploration with Optimistic Actor-Critic
    Ciosek, Kamil
    Quan Vuong
    Loftin, Robert
    Hofmann, Katja
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [33] Twin Delayed Hierarchical Actor-Critic
    Anca, Mihai
    Studley, Matthew
    2021 7TH INTERNATIONAL CONFERENCE ON AUTOMATION, ROBOTICS AND APPLICATIONS (ICARA 2021), 2021, : 221 - 225
  • [34] THE MINIMUM VALUE STATE PROBLEM IN ACTOR-CRITIC NETWORKS
    Velasquez, Alvaro
    Alkhouri, Ismail R.
    Bissey, Brett
    Barak, Lior
    Atia, George K.
    2022 IEEE 32ND INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2022,
  • [35] A Novel Hierarchical Soft Actor-Critic Algorithm for Multi-Logistics Robots Task Allocation
    Tang, Hengliang
    Wang, Anqi
    Xue, Fei
    Yang, Jiaxin
    Cao, Yang
    IEEE ACCESS, 2021, 9 : 42568 - 42582
  • [36] An actor-critic model of saccade adaptation
    Manabu Inaba
    Tadashi Yamazaki
    BMC Neuroscience, 14 (Suppl 1)
  • [37] Genetic Network Programming with Actor-Critic
    Hatakeyama, Hiroyuki
    Mabu, Shingo
    Hirasawa, Kotaro
    Hu, Jinglu
    JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2007, 11 (01) : 79 - 86
  • [38] An Adaptive Threshold for the Canny Edge With Actor-Critic Algorithm
    Choi, Keong-Hun
    Ha, Jong-Eun
    IEEE ACCESS, 2023, 11 : 67058 - 67069
  • [39] Policy Gradient and Actor-Critic Learning in Continuous Time and Space: Theory and Algorithms
    Jia, Yanwei
    Zhou, Xun Yu
    JOURNAL OF MACHINE LEARNING RESEARCH, 2022, 23
  • [40] An Actor-Critic Reinforcement Learning Approach for Energy Harvesting Communications Systems
    Masadeh, Ala'eddin
    Wang, Zhengdao
    Kamal, Ahmed E.
    2019 28TH INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATION AND NETWORKS (ICCCN), 2019,