Hands-on training about overfitting

被引:43
作者
Demsar, Janez [1 ]
Zupan, Blaz [1 ,2 ]
机构
[1] Univ Ljubljana, Fac Comp & Informat Sci, Ljubljana, Slovenia
[2] Baylor Coll Med, Dept Mol & Human Genet, Houston, TX 77030 USA
关键词
All Open Access; Gold; Green;
D O I
10.1371/journal.pcbi.1008671
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Author summary Every teacher strives for an a-ha moment, a sudden revelation by the student who gained a fundamental insight she will always remember. In the past years, authors of this paper have been tailoring their courses in machine learning to include material that could lead students to such discoveries. We aim to expose machine learning to practitioners-not only computer scientists but also molecular biologists and students of biomedicine, that is, the end-users of bioinformatics' computational approaches. In this article, we lay out a course that aims to teach about overfitting, one of the key concepts in machine learning that needs to be understood, mastered, and avoided in data science applications. We propose a hands-on approach that uses an open-source workflow-based data science toolbox that combines data visualization and machine learning. In the proposed training about overfitting, we first deceive the students, then expose the problem, and finally challenge them to find the solution. In the paper, we present three lessons in overfitting and associated data analysis workflows and motivate the use of introduced computation methods by relating them to concepts conveyed by instructors. Overfitting is one of the critical problems in developing models by machine learning. With machine learning becoming an essential technology in computational biology, we must include training about overfitting in all courses that introduce this technology to students and practitioners. We here propose a hands-on training for overfitting that is suitable for introductory level courses and can be carried out on its own or embedded within any data science course. We use workflow-based design of machine learning pipelines, experimentation-based teaching, and hands-on approach that focuses on concepts rather than underlying mathematics. We here detail the data analysis workflows we use in training and motivate them from the viewpoint of teaching goals. Our proposed approach relies on Orange, an open-source data science toolbox that combines data visualization and machine learning, and that is tailored for education in machine learning and explorative data analysis.
引用
收藏
页数:19
相关论文
共 14 条
  • [1] KNIME:: The Konstanz Information Miner
    Berthold, Michael R.
    Cebron, Nicolas
    Dill, Fabian
    Gabriel, Thomas R.
    Koetter, Tobias
    Meinl, Thorsten
    Ohl, Peter
    Sieb, Christoph
    Thiel, Kilian
    Wiswedel, Bernd
    [J]. DATA ANALYSIS, MACHINE LEARNING AND APPLICATIONS, 2008, : 319 - 326
  • [2] Knowledge-based analysis of microarray gene expression data by using support vector machines
    Brown, MPS
    Grundy, WN
    Lin, D
    Cristianini, N
    Sugnet, CW
    Furey, TS
    Ares, M
    Haussler, D
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (01) : 262 - 267
  • [3] Patterns of resistance and incomplete response to docetaxel by gene expression profiling in breast cancer patients
    Chang, JC
    Wooten, EC
    Tsimelzon, A
    Hilsenbeck, SG
    Gutierrez, MC
    Tham, YL
    Kalidas, M
    Elledge, R
    Mohsin, S
    Osborne, CK
    Chamness, GC
    Allred, DC
    Lewis, MT
    Wong, H
    O'Connell, P
    [J]. JOURNAL OF CLINICAL ONCOLOGY, 2005, 23 (06) : 1169 - 1177
  • [4] Ten quick tips for machine learning in computational biology
    Chicco, Davide
    [J]. BIODATA MINING, 2017, 10
  • [5] Microarray data mining with visual programming
    Curk, T
    Demsar, J
    Xu, QK
    Leban, G
    Petrovic, U
    Bratko, I
    Shaulsky, G
    Zupan, B
    [J]. BIOINFORMATICS, 2005, 21 (03) : 396 - 398
  • [6] Demsar J, 2013, J MACH LEARN RES, V14, P2349
  • [7] A Few Useful Things to Know About Machine Learning
    Domingos, Pedro
    [J]. COMMUNICATIONS OF THE ACM, 2012, 55 (10) : 78 - 87
  • [8] Democratized image analytics by visual programming through integration of deep models and small-scale machine learning
    Godec, Primoz
    Pancur, Matja
    Ilenic, Nejc
    Copar, Andrej
    Strazar, Martin
    Erjavec, Ales
    Pretnar, Ajda
    Demsar, Janez
    Staric, Anze
    Toplak, Marko
    Zagar, Lan
    Hartman, Jan
    Wang, Hamilton
    Bellazzi, Riccardo
    Petrovic, Uros
    Garagna, Silvia
    Zuccotti, Maurizio
    Park, Dongsu
    Shaulsky, Gad
    Zupan, Blaz
    [J]. NATURE COMMUNICATIONS, 2019, 10 (1)
  • [9] Machine learning in bioinformatics
    Larranaga, Pedro
    Calvo, Borja
    Santana, Roberto
    Bielza, Concha
    Galdiano, Josu
    Inza, Inaki
    Lozano, Jose A.
    Armananzas, Ruben
    Santafe, Guzman
    Perez, Aritz
    Robles, Victor
    [J]. BRIEFINGS IN BIOINFORMATICS, 2006, 7 (01) : 86 - 112
  • [10] VizRank: finding informative data projections in functional genomics by machine learning
    Leban, G
    Bratko, I
    Petrovic, U
    Curk, T
    Zupan, B
    [J]. BIOINFORMATICS, 2005, 21 (03) : 413 - 414