Evaluation in artificial intelligence: from task-oriented to ability-oriented measurement

被引：71

作者：

Hernandez-Orallo, Jose ^{[1
]}

机构：

[1] Univ Politecn Valencia, DSIC, Valencia, Spain

来源：

ARTIFICIAL INTELLIGENCE REVIEW | 2017年 / 48卷 / 03期

关键词：

AI evaluation; AI competitions; Machine intelligence; Cognitive abilities; Universal psychometrics; Turing test; INTERNATIONAL PLANNING COMPETITION; UNIVERSAL INTELLIGENCE; COGNITIVE-ABILITIES; COMPUTER-SCIENCE; REINFORCEMENT; BENCHMARKING; ITEM; ENVIRONMENT; SIMPLICITY; COMPLEXITY;

D O I：

10.1007/s10462-016-9505-7

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The evaluation of artificial intelligence systems and components is crucial for the progress of the discipline. In this paper we describe and critically assess the different ways AI systems are evaluated, and the role of components and techniques in these systems. We first focus on the traditional task-oriented evaluation approach. We identify three kinds of evaluation: human discrimination, problem benchmarks and peer confrontation. We describe some of the limitations of the many evaluation schemes and competitions in these three categories, and follow the progression of some of these tests. We then focus on a less customary (and challenging) ability-oriented evaluation approach, where a system is characterised by its (cognitive) abilities, rather than by the tasks it is designed to solve. We discuss several possibilities: the adaptation of cognitive tests used for humans and animals, the development of tests derived from algorithmic information theory or more integrated approaches under the perspective of universal psychometrics. We analyse some evaluation tests from AI that are better positioned for an ability-oriented evaluation and discuss how their problems and limitations can possibly be addressed with some of the tools and ideas that appear within the paper. Finally, we enumerate a series of lessons learnt and generic guidelines to be used when an AI evaluation scheme is under consideration.

引用

页码：397 / 447

页数：51

共 251 条

[1]

Abel D, 2016, ICML WORKSH ABSTR RE

[2] I-athlon: Toward a Multidimensional Turing Test [J].

Adams, Sam S. ;

Banavar, Guruduth ;

Campbell, Murray .

AI MAGAZINE, 2016, 37 (01) :78-84

[3] Mapping the Landscape of Human-Level Artificial General Intelligence [J].

Adams, Sam S. ;

Arel, Itamar ;

Bach, Joscha ;

Coop, Robert ;

Furlan, Rod ;

Goertzel, Ben ;

Hall, J. Storrs ;

Samsonovich, Alexei ;

Scheutz, Matthias ;

Schlesinger, Matthew ;

Shapiro, Stuart C. ;

Sowa, John F. .

AI MAGAZINE, 2012, 33 (01) :25-41

[4]

Alcalá-Fdez J, 2011, J MULT-VALUED LOG S, V17, P255

[5] Intelligence, learning and long-term memory [J].

Alexander, JRM ;

Smales, S .

PERSONALITY AND INDIVIDUAL DIFFERENCES, 1997, 23 (05) :815-825

[6]

Alpcan T, 2014, IEEE INF THEOR WORKS

[7] Syntax-Guided Synthesis [J].

Alur, Rajeev ;

Bodik, Rastislav ;

Dallal, Eric ;

Fisman, Dana ;

Garg, Pranav ;

Juniwal, Garvit ;

Kress-Gazit, Hadas ;

Madhusudan, P. ;

Martin, Milo M. K. ;

Raghothaman, Mukund ;

Saha, Shamwaditya ;

Seshia, Sanjit A. ;

Singh, Rishabh ;

Solar-Lezama, Armando ;

Torlak, Emina ;

Udupa, Abhishek .

DEPENDABLE SOFTWARE SYSTEMS ENGINEERING, 2015, 40 :1-25

[8] Beyond the turing test: Performance metrics for evaluating a computer simulation of the human mind [J].

Alvarado, N ;

Adams, SS ;

Burbeck, S ;

Latta, C .

2ND INTERNATIONAL CONFERENCE ON DEVELOPMENT AND LEARNING, PROCEEDINGS, 2002, :147-152

[9] Competitions for Benchmarking Task and Functionality Scoring Complete Performance Assessment [J].

Amigoni, Francesco ;

Bastianelli, Emanuele ;

Berghofer, Jakob ;

Bonarini, Andrea ;

Fontana, Giulio ;

Hochgeschwender, Nico ;

Iocchi, Luca ;

Kraetzschmar, Gerhard K. ;

Lima, Pedro ;

Matteucci, Matteo ;

Miraldo, Pedro ;

Nardi, Daniele ;

Schiaffonati, Viola .

IEEE ROBOTICS & AUTOMATION MAGAZINE, 2015, 22 (03) :53-61

[10] Robotics competitions as benchmarks for AI research [J].

Anderson, John ;

Baltes, Jacky ;

Cheng, Chi Tai .

KNOWLEDGE ENGINEERING REVIEW, 2011, 26 (01) :11-17

← 1 2 3 4 5 6 7 8 9 10 →