A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play

被引：2201

作者：

Silver, David ^{[1
,2
]}

Hubert, Thomas ^{[1
]}

Schrittwieser, Julian ^{[1
]}

Antonoglou, Ioannis ^{[1
]}

Lai, Matthew ^{[1
]}

Guez, Arthur ^{[1
]}

Lanctot, Marc ^{[1
]}

Sifre, Laurent ^{[1
]}

Kumaran, Dharshan ^{[1
]}

Graepel, Thore ^{[1
]}

Lillicrap, Timothy ^{[1
]}

Simonyan, Karen ^{[1
]}

Hassabis, Demis ^{[1
]}

机构：

[1] DeepMind, 6 Pancras Sq, London N1C 4AG, England

[2] UCL, Gower St, London WC1E 6BT, England

来源：

SCIENCE | 2018年 / 362卷 / 6419期

关键词：

GAME;

D O I：

10.1126/science.aar6404

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

The game of chess is the longest-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. By contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go by reinforcement learning from self-play. In this paper, we generalize this approach into a single AlphaZero algorithm that can achieve superhuman performance in many challenging games. Starting from random play and given no domain knowledge except the game rules, AlphaZero convincingly defeated a world champion program in the games of chess and shogi (Japanese chess), as well as Go.

引用

页码：1140 / +

页数：30

共 40 条

[1]

Allis V. L., 1994, THESIS

[2]

[Anonymous], 2017, 44 ANN INT S COMPUTE

[3]

[Anonymous], Thesis |

[4]

[Anonymous], 2015, ARXIV EPRINTS ARXIV1

[5]

[Anonymous], ADV NEURAL INFORM PR

[6]

Anthony T., 2017, ARXIV170508439

[7]

Arenz O., 2012, THESIS

[8] Learning to play chess using temporal differences [J].

Baxter, J ;

Tridgell, A ;

Weaver, L .

MACHINE LEARNING, 2000, 40 (03) :243-263

[9] Temporal difference learning applied to game playing and the results of application to shogi [J].

Beal, DF ;

Smith, MC .

THEORETICAL COMPUTER SCIENCE, 2001, 252 (1-2) :105-119

[10] Temporal difference learning for heuristic search and game playing [J].

Beal, DF ;

Smith, MC .

INFORMATION SCIENCES, 2000, 122 (01) :3-21

← 1 2 3 4 →