PMLB v1.0: an open-source dataset collection for benchmarking machine learning methods

被引:14
作者
Romano, Joseph D. [1 ,2 ]
Le, Trang T. [1 ]
La Cava, William [1 ]
Gregg, John T. [1 ]
Goldberg, Daniel J. [3 ]
Chakraborty, Praneel [4 ,5 ]
Ray, Natasha L. [6 ]
Himmelstein, Daniel [7 ,8 ]
Fu, Weixuan [1 ]
Moore, Jason H. [1 ]
机构
[1] Univ Penn, Inst Biomed Informat, Philadelphia, PA 19104 USA
[2] Univ Penn, Ctr Excellence Environm Toxicol, Philadelphia, PA 19104 USA
[3] Washington Univ, Dept Comp Sci & Engn, St Louis, MO 63130 USA
[4] Univ Penn, Sch Arts & Sci, Philadelphia, PA 19104 USA
[5] Univ Penn, Wharton Sch, Philadelphia, PA 19104 USA
[6] Princeton Day Sch, Princeton, NJ 08540 USA
[7] Related Sci, Denver, CO 80220 USA
[8] Univ Penn, Dept Syst Pharmacol & Translat Therapeut, Philadelphia, PA 19104 USA
基金
美国国家卫生研究院;
关键词
D O I
10.1093/bioinformatics/btab727
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Novel machine learning and statistical modeling studies rely on standardized comparisons to existing methods using well-studied benchmark datasets. Few tools exist that provide rapid access to many of these datasets through a standardized, user-friendly interface that integrates well with popular data science workflows. Results: This release of PMLB (Penn Machine Learning Benchmarks) provides the largest collection of diverse, public benchmark datasets for evaluating new machine learning and data science methods aggregated in one location. v1.0 introduces a number of critical improvements developed following discussions with the open-source community.
引用
收藏
页码:878 / 880
页数:3
相关论文
共 12 条
  • [1] Caruana R., 2006, P 23 ICML PITTSB PA, P161
  • [2] Cortes C., 1995, KDD-95 Proceedings. First International Conference on Knowledge Discovery and Data Mining, P57
  • [3] Dua D, 2019, UCI MACHINE LEARNING
  • [4] Hastie T., 2009, Springer Series in Statistics, V2, DOI [DOI 10.1007/978-0-387-84858-7, 10.1007/978-0-387-21606-5]
  • [5] Systematic benchmarking of omics computational tools
    Mangul, Serghei
    Martin, Lana S.
    Hill, Brian L.
    Lam, Angela Ka-Mei
    Distler, Margaret G.
    Zelikovsky, Alex
    Eskin, Eleazar
    Flint, Jonathan
    [J]. NATURE COMMUNICATIONS, 2019, 10 (1)
  • [6] Benchmarking of computational error-correction methods for next-generation sequencing data
    Mitchell, Keith
    Brito, Jaqueline J.
    Mandric, Igor
    Wu, Qiaozhen
    Knyazev, Sergey
    Chang, Sei
    Martin, Lana S.
    Karlsberg, Aaron
    Gerasimov, Ekaterina
    Littman, Russell
    Hill, Brian L.
    Wu, Nicholas C.
    Yang, Harry Taegyun
    Hsieh, Kevin
    Chen, Linus
    Littman, Eli
    Shabani, Taylor
    Enik, German
    Yao, Douglas
    Sun, Ren
    Schroeder, Jan
    Eskin, Eleazar
    Zelikovsky, Alex
    Skums, Pavel
    Pop, Mihai
    Mangul, Serghei
    [J]. GENOME BIOLOGY, 2020, 21 (01)
  • [7] Benchmarking network for clinical and humanistic outcomes in diabetes (BENCH-D) study: protocol, tools, and population
    Nicolucci, Antonio
    Rossi, Maria C.
    Pellegrini, Fabio
    Lucisano, Giuseppe
    Pintaudi, Basilio
    Gentile, Sandro
    Marra, Giampiero
    Skovlund, Soren E.
    Vespasiani, Giacomo
    [J]. SPRINGERPLUS, 2014, 3 : 1 - 9
  • [8] PMLB: a large benchmark suite for machine learning evaluation and comparison
    Olson, Randal S.
    La Cava, William
    Orzechowski, Patryk
    Urbanowicz, Ryan J.
    Moore, Jason H.
    [J]. BIODATA MINING, 2017, 10
  • [9] Pedregosa F, 2011, J MACH LEARN RES, V12, P2825
  • [10] Foundations of JSON']JSON Schema
    Pezoa, Felipe
    Reutter, Juan L.
    Suarez, Fernando
    Ugarte, Martin
    Vrgoc, Domagoj
    [J]. PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'16), 2016, : 263 - 273