Mapping global dynamics of benchmark creation and saturation in artificial intelligence

被引:7
作者
Ott, Simon [1 ]
Barbosa-Silva, Adriano [1 ,2 ]
Blagec, Kathrin [1 ]
Brauner, Jan [3 ,4 ]
Samwald, Matthias [1 ]
机构
[1] Med Univ Vienna, Inst Artificial Intelligence, Wahringerstr 25a, A-1090 Vienna, Austria
[2] ITTM SA Informat Technol Translat Med, L-4354 Esch Sur Alzette, Luxembourg
[3] Univ Oxford, Dept Comp Sci, Oxford Appl & Theoret Machine Learning OATML Grp, Oxford, England
[4] Univ Oxford, Future Humanity Inst, Oxford, England
基金
英国工程与自然科学研究理事会;
关键词
D O I
10.1038/s41467-022-34591-0
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Recent studies raised concerns over the state of AI benchmarking, reporting issues such as benchmark overfitting, benchmark saturation and increasing centralization of benchmark dataset creation. To facilitate monitoring of the health of the AI benchmarking ecosystem, the authors introduce methodologies for creating condensed maps of the global dynamics of benchmark. Benchmarks are crucial to measuring and steering progress in artificial intelligence (AI). However, recent studies raised concerns over the state of AI benchmarking, reporting issues such as benchmark overfitting, benchmark saturation and increasing centralization of benchmark dataset creation. To facilitate monitoring of the health of the AI benchmarking ecosystem, we introduce methodologies for creating condensed maps of the global dynamics of benchmark creation and saturation. We curate data for 3765 benchmarks covering the entire domains of computer vision and natural language processing, and show that a large fraction of benchmarks quickly trends towards near-saturation, that many benchmarks fail to find widespread utilization, and that benchmark performance gains for different AI tasks are prone to unforeseen bursts. We analyze attributes associated with benchmark popularity, and conclude that future benchmarks should emphasize versatility, breadth and real-world utility.
引用
收藏
页数:11
相关论文
共 26 条
[1]   High-energy 1-ns pulses from an erbium-doped fluoride fiber amplifier at 2.8 μm [J].
Aydin, Yigit Ozan ;
Magnan-Saucier, Sebastien ;
Zhang, Daiying ;
Kraemer, Darren ;
Fortin, Vincent ;
Vallee, Real ;
Bernier, Martin .
2021 ANNUAL CONFERENCE OF THE IEEE PHOTONICS SOCIETY (IPC), 2021,
[2]  
Barbosa-Silva A., 2022, SUPPLEMENTARY DATA M
[3]  
Blagec K., 2022, P NLP POWER 1 WORKSH, P52, DOI 10.18653/v1/2022.nlppower-1.6
[4]  
Blagec K, 2022, Arxiv, DOI [arXiv:2201.07040, 10.48550/arxiv.2201.07040, DOI 10.48550/ARXIV.2201.07040]
[5]   A curated, ontology-based, large-scale knowledge graph of artificial intelligence tasks and benchmarks [J].
Blagec, Kathrin ;
Barbosa-Silva, Adriano ;
Ott, Simon ;
Samwald, Matthias .
SCIENTIFIC DATA, 2022, 9 (01)
[6]   Protein function prediction via graph kernels [J].
Borgwardt, KM ;
Ong, CS ;
Schönauer, S ;
Vishwanathan, SVN ;
Smola, AJ ;
Kriegel, HP .
BIOINFORMATICS, 2005, 21 :I47-I56
[7]  
Bowman Samuel R., 2021, P 2021 C N AM CHAPTE, P4843, DOI DOI 10.18653/V1/2021.NAACL-MAIN.385
[8]  
Dehghani M., 2021, BENCHMARK LOTTERY
[9]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[10]  
Dosovitskiy A, 2021, Arxiv, DOI [arXiv:2010.11929, DOI 10.48550/ARXIV.2010.11929]