Survey of Deep Reinforcement Learning Based on Value Function and Policy Gradient

被引:0
作者
Liu J.-W. [1 ]
Gao F. [1 ]
Luo X.-L. [1 ]
机构
[1] Department of Automation, China University of Petroleum, Beijing
来源
Jisuanji Xuebao/Chinese Journal of Computers | 2019年 / 42卷 / 06期
基金
中国国家自然科学基金;
关键词
Deep learning; Deep reinforcement learning; Machine learning; Policy gradient; Reinforcement learning; Value function;
D O I
10.11897/SP.J.1016.2019.01406
中图分类号
学科分类号
摘要
As a hot research problem in the field of artificial intelligence, Deep Reinforcement Learning(DRL) has attracted more and more attention since it was proposed. At present, DRL can solve many problems that were previously difficult to solve such as learning how to play video games directly from raw pixels and learning a control strategy for robot problems. DRL builds an autonomous system with a higher level understanding of the visual world by a continous optimization of the control strategy. Among them, DRL based on value function and policy gradient is the core basic method and research focus. This paper systematically elaborates and summarizes two types of DRL methods including solving algorithms and network structures. Firstly, DRL methods based on value function are summarized, including Deep Q-Network(DQN) and improved methods based on DQN. DQN is a pioneering work in the field of DRL. This model trains Convolutional Neural Network (CNN) with a variety of Q learning. Before the emergence of DQN, the problem of instability or even non-convergence will appear when the action value function in Reinforcement Learning(RL) is approximated by neural network. To solve this problem, DQN uses two technologies: the experience replay mechanism and the target network. According to different emphasis on DQN improvement, various improved versions based on DQN can be divided into four categories: improvement of training algorithm, improvement of neural network structure, improvement of introduction of new learning mechanism and improvement based on new proposed RL algorithm. The research motivation, overall thinking, advantages and disadvantages, application scope and performance of DQN improvement are elaborated in detail. Then the concept and common algorithms of policy gradient are introduced. Policy gradient algorithm is widely used for RL problems in continuous space. Its main idea is to parameterize the policy, calculate the policy gradient about the action and the action is adjusted continuously along the direction of the gradient and the optimal policy is gradually obtained. The common policy gradient algorithm includes REINFORCE algorithm and Actor-Critic algorithm. Also, DRL methods based on policy gradient are summarized including Deep Deterministic Policy Gradient(DDPG), Trust Region Policy Optimization(TRPO), Asynchronous Advantage Actor-Critic(A3C) and some corresponding improved methods. Drawing on DQN technology, DDPG adopts the experience replay mechanism and a separate target network to reduce the correlation between data and increase the stability and robustness of the algorithm. The problem solved by TRPO is to select the appropriate step size by introducing the trust region constraint defined by Kullback-Leibler divergence so as to ensure that the optimization of the policy is always in the good direction. A3C uses a conceptually simple and lightweight DRL framework and optimizes the deep neural network controller using an asynchronous gradient descent method. Then, AlphaGo and Alpha Zero which represent advanced research achievements of DRL are summarized and the relationship between the latter and the two DRL methods summarized in this paper is analyzed. Then some common experimental platforms of DRL algorithms are introduced including ALE, OpenAI Gym, RLLab, MuJoCo and TORCS. Finally, the future research directions of DRL are prospected. © 2019, Science Press. All right reserved.
引用
收藏
页码:1406 / 1438
页数:32
相关论文
共 135 条
[1]  
Munos R., From bandits to Monte-Carlo tree search: The optimistic principle applied to optimization and planning, Foundations and Trends in Machine Learning, 7, 1, pp. 1-129, (2014)
[2]  
Sutton R.S., Barto A.G., Reinforcement Learning: An Introduction, (1998)
[3]  
Bertsekas D.P., Bertsekas D.P., Bertsekas D.P., Et al., Dynamic Programming and Optimal Control, (1995)
[4]  
Szepesvari C., Algorithms for reinforcement learning, Synthesis Lectures on Artificial Intelligence and Machine Learning, 4, 1, pp. 1-103, (2010)
[5]  
Krizhevsky A., Sutskever I., Hinton G.E., ImageNet classification with deep convolutional neural networks, Proceedingsof the International Conference on Neural Information Processing Systems, pp. 1097-1105, (2012)
[6]  
Sermanet P., Kavukcuoglu K., Chintala S., Et al., Pedestrian detection with unsupervised multi-stage feature learning, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3626-3633, (2013)
[7]  
Dahl G.E., Acero A., Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition, IEEE Transactions on Audio Speech & Language Processing, 20, 1, pp. 30-42, (2011)
[8]  
Graves A., Mohamed A.R., Hinton G., Speech recognition with deep recurrent neural networks, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645-6649, (2013)
[9]  
Huval B., Coates A., Ng A., Deep learning for class-generic object detection, (2013)
[10]  
Makantasis K., Karantzalos K., Doulamis A., Et al., Deep learning-based man-made object detection from hyperspectral data, Proceedings of the Advances in Visual Computing-11th International Symposium, pp. 717-727, (2015)