Dual Indicators to Analyze AI Benchmarks: Difficulty, Discrimination, Ability, and Generality

被引:12
作者
Martinez-Plumed, Fernando [1 ]
Hernandez-Orallo, Jose [1 ]
机构
[1] Univ Politecn Valencia, Valencia 46022, Spain
关键词
Artificial intelligence; Games; Benchmark testing; Task analysis; Adaptation models; Guidelines; Indexes; Artificial intelligence (AI) benchmarks; AI evaluation; generality; item response theory (ITR); ITEM RESPONSE THEORY; GAME; COMPETITION;
D O I
10.1109/TG.2018.2883773
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the purpose of better analyzing the result of artificial intelligence (AI) benchmarks, we present two indicators on the side of the AI problems, difficulty and discrimination, and two indicators on the side of the AI systems, ability and generality. The first three are adapted from psychometric models in item response theory (IRT), whereas generality is defined as a new metric that evaluates whether an agent is consistently good at easy problems and bad at difficult ones. We illustrate how these key indicators give us more insight on the results of two popular benchmarks in AI, the Arcade Learning Environment (Atari 2600 games) and the General Video Game AI competition, and we include some guidelines to estimate and interpret these indicators for other AI benchmarks and competitions.
引用
收藏
页码:121 / 131
页数:11
相关论文
共 51 条
  • [1] [Anonymous], 2012, Believable bots: Can computers play like people? Germany
  • [2] [Anonymous], 2015, J ARTIF INTELL RES
  • [3] [Anonymous], THESIS
  • [4] Ashlock Daniel, 2017, 2017 IEEE Conference on Computational Intelligence and Games (CIG), P17, DOI 10.1109/CIG.2017.8080410
  • [5] Bache K., 2013, UCI machine learning repository
  • [6] Balduzzi D., 2018, ARXIV180602643
  • [7] Birnbaum A., 1968, STAT THEORIES MENTAL
  • [8] Bontrager P., 2016, AIIDE
  • [9] A Survey of Monte Carlo Tree Search Methods
    Browne, Cameron B.
    Powley, Edward
    Whitehouse, Daniel
    Lucas, Simon M.
    Cowling, Peter I.
    Rohlfshagen, Philipp
    Tavener, Stephen
    Perez, Diego
    Samothrakis, Spyridon
    Colton, Simon
    [J]. IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, 2012, 4 (01) : 1 - 43
  • [10] Deep blue
    Campbell, M
    Hoane, AJ
    Hsu, FH
    [J]. ARTIFICIAL INTELLIGENCE, 2002, 134 (1-2) : 57 - 83