Dual Indicators to Analyze AI Benchmarks: Difficulty, Discrimination, Ability, and Generality

被引：12

作者：

Martinez-Plumed, Fernando ^{[1
]}

Hernandez-Orallo, Jose ^{[1
]}

机构：

[1] Univ Politecn Valencia, Valencia 46022, Spain

来源：

IEEE TRANSACTIONS ON GAMES | 2020年 / 12卷 / 02期

关键词：

Artificial intelligence; Games; Benchmark testing; Task analysis; Adaptation models; Guidelines; Indexes; Artificial intelligence (AI) benchmarks; AI evaluation; generality; item response theory (ITR); ITEM RESPONSE THEORY; GAME; COMPETITION;

D O I：

10.1109/TG.2018.2883773

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

With the purpose of better analyzing the result of artificial intelligence (AI) benchmarks, we present two indicators on the side of the AI problems, difficulty and discrimination, and two indicators on the side of the AI systems, ability and generality. The first three are adapted from psychometric models in item response theory (IRT), whereas generality is defined as a new metric that evaluates whether an agent is consistently good at easy problems and bad at difficult ones. We illustrate how these key indicators give us more insight on the results of two popular benchmarks in AI, the Arcade Learning Environment (Atari 2600 games) and the General Video Game AI competition, and we include some guidelines to estimate and interpret these indicators for other AI benchmarks and competitions.

引用

页码：121 / 131

页数：11

共 51 条

[1] [Anonymous], 2012, Believable bots: Can computers play like people? Germany
[2] [Anonymous], 2015, J ARTIF INTELL RES
[3] [Anonymous], THESIS
[4] Ashlock Daniel, 2017, 2017 IEEE Conference on Computational Intelligence and Games (CIG), P17, DOI 10.1109/CIG.2017.8080410
[5] Bache K., 2013, UCI machine learning repository
[6] Balduzzi D., 2018, ARXIV180602643
[7] Birnbaum A., 1968, STAT THEORIES MENTAL
[8] Bontrager P., 2016, AIIDE
[9] A Survey of Monte Carlo Tree Search Methods
Browne, Cameron B.
Powley, Edward
Whitehouse, Daniel
Lucas, Simon M.
Cowling, Peter I.
Rohlfshagen, Philipp
Tavener, Stephen
Perez, Diego
Samothrakis, Spyridon
Colton, Simon
[J]. IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, 2012, 4 (01) : 1 - 43
[10] Deep blue
Campbell, M
Hoane, AJ
Hsu, FH
[J]. ARTIFICIAL INTELLIGENCE, 2002, 134 (1-2) : 57 - 83

← 1 2 3 4 5 6 →