Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors

被引：25

作者：

Okun, Oleg ^{[1
]}

Priisalu, Helen ^{[2
]}

机构：

[1] Univ Oulu, Elect & Informat Engn Dept, Oulu 90014, Finland

[2] Tallinn Univ Technol, Inst Cybernet, EE-12618 Tallinn, Estonia

来源：

ARTIFICIAL INTELLIGENCE IN MEDICINE | 2009年 / 45卷 / 2-3期

关键词：

Pattern recognition; Gene expression; Cancer classification; k-nearest neighbors; Ensemble of classifiers; FEATURE-SELECTION; MICROARRAY DATA; DNA; PREDICTION; CLASSIFIERS; TUMOR;

D O I：

10.1016/j.artmed.2008.08.004

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Objective: We explore the link between dataset complexity, determining how difficult a dataset is for classification, and classification performance defined by low-variance and tow-biased bolstered resubstitution error made by k-nearest neighbor classifiers. Methods and material: Gene expression based cancer classification is used as the task in this study. Six gene expression datasets containing different types of cancer constitute test data. Results: Through extensive simulation coupled with the copula method for analysis of association in bivariate data, we show that dataset complexity and bolstered resubstitution error are associated in terms of dependence. As a result, we propose a new scheme for generating ensembles of classifiers that selects subsets of features of low complexity for ensemble members, which constitutes the accurate members according to the found dependence relation. Conclusion: Experiments with six gene expression datasets demonstrate that our ensemble generating scheme based on the dependence of dataset complexity and classification error is superior to a-single best classifier in the ensemble and to the traditional ensemble construction scheme that is ignorant of dataset complexity. (c) 2008 Elsevier B.V. All rights reserved.

引用

页码：151 / 162

页数：12

共 38 条

[1] Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays
Alon, U
Barkai, N
Notterman, DA
Gish, K
Ybarra, S
Mack, D
Levine, AJ
[J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (12) : 6745 - 6750
[2] [Anonymous], 2004, COMBINING PATTERN CL, DOI DOI 10.1002/0471660264
[3] [Anonymous], 1959, PUBLICATIONS LINSTIT
[4] Bay S. D., 1999, Intelligent Data Analysis, V3, P191, DOI 10.1016/S1088-467X(99)00018-9
[5] Combining dissimilarity based classifiers for cancer prediction using gene expression profiles
Blanco, Angela
Martin-Merino, Manuel
Rivas, Javier De las
[J]. BMC BIOINFORMATICS, 2007, 8 (Suppl 8)
[6] Bo TH, 2002, GENOME BIOL, V3
[7] Bolstered error estimation
Braga-Neto, U
Dougherty, E
[J]. PATTERN RECOGNITION, 2004, 37 (06) : 1267 - 1281
[8] Data mining for gene expression profiles from DNA, microarray
Cho, SB
Won, HH
[J]. INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2003, 13 (06) : 593 - 608
[9] Classifying gene expression data of cancer using classifier ensemble with mutually exclusive features
Cho, SB
Ryu, JW
[J]. PROCEEDINGS OF THE IEEE, 2002, 90 (11) : 1744 - 1753
[10] Gene selection and classification of microarray data using random forest -: art. no. 3
Díaz-Uriarte, R
de Andrés, SA
[J]. BMC BIOINFORMATICS, 2006, 7 (1)

← 1 2 3 4 →