CuMiDa: An Extensively Curated Microarray Database for Benchmarking and Testing of Machine Learning Approaches in Cancer Research

被引:44
作者
Feltes, Bruno Cesar [1 ]
Chandelier, Eduardo Bassani [1 ]
Grisci, Bruno Iochins [1 ]
Dorn, Marcio [1 ]
机构
[1] Univ Fed Rio Grande do Sul, Inst Informat, Porto Alegre, RS, Brazil
关键词
benchmarking; cancer; classification; curation; machine learning; microarray; supervised learning; unsupervised learning; CLASSIFICATION; PREDICTION; PATTERNS;
D O I
10.1089/cmb.2018.0238
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
The employment of machine learning (ML) approaches to extract gene expression information from microarray studies has increased in the past years, specially on cancer-related works. However, despite this continuous interest in applying ML in cancer biomedical research, there are no curated repositories focused only on providing quality data sets exclusively for benchmarking and testing of such techniques for cancer research. Thus, in this work, we present the Curated Microarray Database (CuMiDa), a database composed of 78 handpicked microarray data sets for Homo sapiens that were carefully examined from more than 30,000 microarray experiments from the Gene Expression Omnibus using a rigorous filtering criteria. All data sets were individually submitted to background correction, normalization, sample quality analysis and were manually edited to eliminate erroneous probes. All data sets were tested using principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) analyses to observe sample division and were additionally tested using various ML approaches to provide a base accuracy for the major techniques employed for microarray data sets. CuMiDa is a database created solely for benchmarking and testing of ML approaches applied to cancer research.
引用
收藏
页码:376 / 386
页数:11
相关论文
共 48 条
  • [1] Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling
    Alizadeh, AA
    Eisen, MB
    Davis, RE
    Ma, C
    Lossos, IS
    Rosenwald, A
    Boldrick, JG
    Sabet, H
    Tran, T
    Yu, X
    Powell, JI
    Yang, LM
    Marti, GE
    Moore, T
    Hudson, J
    Lu, LS
    Lewis, DB
    Tibshirani, R
    Sherlock, G
    Chan, WC
    Greiner, TC
    Weisenburger, DD
    Armitage, JO
    Warnke, R
    Levy, R
    Wilson, W
    Grever, MR
    Byrd, JC
    Botstein, D
    Brown, PO
    Staudt, LM
    [J]. NATURE, 2000, 403 (6769) : 503 - 511
  • [2] Microarray data analysis: from disarray to consolidation and consensus
    Allison, DB
    Cui, XQ
    Page, GP
    Sabripour, M
    [J]. NATURE REVIEWS GENETICS, 2006, 7 (01) : 55 - 65
  • [3] Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays
    Alon, U
    Barkai, N
    Notterman, DA
    Gish, K
    Ybarra, S
    Mack, D
    Levine, AJ
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (12) : 6745 - 6750
  • [4] Supervised, Unsupervised, and Semi-Supervised Feature Selection: A Review on Gene Selection
    Ang, Jun Chin
    Mirzal, Andri
    Haron, Habibollah
    Hamed, Haza Nuzly Abdull
    [J]. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2016, 13 (05) : 971 - 989
  • [5] Blalock E.M., 2003, BEGINNERS GUIDE MICR
  • [6] New developments in microarray technology
    Blohm, DH
    Guiseppi-Elie, A
    [J]. CURRENT OPINION IN BIOTECHNOLOGY, 2001, 12 (01) : 41 - 47
  • [7] Performance analysis of clustering techniques over microarray data: A case study
    Dash, Rasmita
    Misra, Bijan Bihari
    [J]. PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2018, 493 : 162 - 176
  • [8] Gene selection and classification of microarray data using random forest -: art. no. 3
    Díaz-Uriarte, R
    de Andrés, SA
    [J]. BMC BIOINFORMATICS, 2006, 7 (1)
  • [9] lumi:: a pipeline for processing Illumina microarray
    Du, Pan
    Kibbe, Warren A.
    Lin, Simon M.
    [J]. BIOINFORMATICS, 2008, 24 (13) : 1547 - 1548
  • [10] beadarray:: R classes and methods for Illumina bead-based data
    Dunning, Mark J.
    Smith, Mike L.
    Ritchie, Matthew E.
    Tavare, Simon
    [J]. BIOINFORMATICS, 2007, 23 (16) : 2183 - 2184