The Effect of Number of Iterations on Ensemble Gene Selection

被引：3

作者：

Awada, Wael ^{[1
]}

Khoshgoftaar, Taghi ^{[1
]}

Dittman, David ^{[1
]}

Wald, Randall ^{[1
]}

机构：

[1] Florida Atlantic Univ, Boca Raton, FL 33431 USA

来源：

2012 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2012), VOL 2 | 2012年

关键词：

Classification; DNA Microarray; Ensemble Feature Selection; IDENTIFICATION; BIOINFORMATICS;

D O I：

10.1109/ICMLA.2012.224

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Dimensionality-reducing techniques such as gene selection have become commonplace in order to reduce the high dimensionality found within bioinformatics datasets such as DNA microarray datasets. The degree of dimensionality is reduced by identifying and removing redundant and irrelevant features or genes and leaving only an optimum subset of features for subsequent analysis. However, a number of feature selection techniques show poor stability (resistance to change in the underlying data). One approach for increasing the stability of feature subsets is ensemble feature selection. This is performed first by generating multiple ranked gene lists and then aggregating the results using an aggregation function. While research has been performed on ensemble feature selection and its effect on gene list stability, there has been little research on an important choice made in the process of ensemble feature selection: the number of iterations (or repetitions) of feature selection. The computation time of ensemble feature selection is greatly affected by the number of ranked lists generated: the higher the number of iterations, the more computation time is required. To study this, we evaluate the similarity among feature subsets generated from two different approaches to ensemble feature selection (data diversity and hybrid approach). We calculate the similarity between the final ranked lists generated using 10, 20 and 50 iterations, using the mean aggregation function. Our results show that the similarity between 20 and 50 iterations is high enough for us to recommend using 20 iterations instead of 50 and thus saving the large amount of computation time required for 50 iterations.

引用

页码：198 / 203

页数：6

共 17 条

[1] Robust biomarker identification for cancer diagnosis with ensemble feature selection methods [J].

Abeel, Thomas ;

Helleputte, Thibault ;

Van de Peer, Yves ;

Dupont, Pierre ;

Saeys, Yvan .

BIOINFORMATICS, 2010, 26 (03) :392-398

[2]

Abu Shanab A, 2011, 2011 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), P234, DOI 10.1109/IRI.2011.6009552

[3]

Awada W, 2012, 2012 IEEE 13TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), P356, DOI 10.1109/IRI.2012.6303031

[4]

Dittman DJ, 2011, HANDBOOK OF DATA INTENSIVE COMPUTING, P685, DOI 10.1007/978-1-4614-1415-5_27

[5] A new ensemble feature selection and its application to pattern classification [J].

Zhang D. ;

Wang Y. .

Journal of Control Theory and Applications, 2009, 7 (04) :419-426

[6]

Kuncheva LI, 2007, PROCEEDINGS OF THE IASTED INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND APPLICATIONS, P390

[7] Detecting epistatic effects in association studies at a genomic level based on an ensemble approach [J].

Li, Jing ;

Horstman, Benjamin ;

Chen, Yixuan .

BIOINFORMATICS, 2011, 27 (13) :I222-I229

[8] Ensemble gene selection by grouping for microarray data classification [J].

Liu, Huawen ;

Liu, Lei ;

Zhang, Huijie .

JOURNAL OF BIOMEDICAL INFORMATICS, 2010, 43 (01) :81-87

[9]

SAEYS Y, 2008, ECML PKDD 08 P EUR, V5212, P313

[10] A review of feature selection techniques in bioinformatics [J].

Saeys, Yvan ;

Inza, Inaki ;

Larranaga, Pedro .

BIOINFORMATICS, 2007, 23 (19) :2507-2517

← 1 2 →