Estimation of Distribution Algorithms as Logistic Regression Regularizers of Microarray Classifiers

被引:9
作者
Bielza, C. [1 ]
Robles, V. [2 ]
Larranaga, P. [1 ]
机构
[1] Univ Politecn Madrid, Dept Inteligencia Artificial, E-28040 Madrid, Spain
[2] Univ Politecn Madrid, Dept Arquitectura & Tecnol Sistemas Informat, E-28040 Madrid, Spain
关键词
Logistic regression; regularization; estimation of distribution algorithms; DNA microarrays; GENE-EXPRESSION; MOLECULAR DIAGNOSIS; SELECTION; CLASSIFICATION; CANCER; DISCRIMINATION; TUMORS;
D O I
10.3414/ME9223
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objectives: The "large k (genes), small N (samples)" phenomenon complicates the problem of microarray classification with logistic regression. The indeterminacy of the maximum likelihood solutions, multicollinearity of predictor variables and data over-fitting cause unstable parameter estimates. Moreover, computational problems arise due to the large number of predictor (genes) variables. Regularized logistic regression excels as a solution. However, the difficulties found here involve an objective function hard to be optimized from a mathematical viewpoint and a careful required tuning of the regularization parameters. Methods: Those difficulties are tackled by introducing a new way of regularizing the logistic regression. Estimation of distribution algorithms (EDAs), a kind of evolutionary algorithms, emerge as natural regularizers. Obtaining the regularized estimates of the logistic classifier amounts to maximizing the likelihood function via our EDA, without having to be penalized. Likelihood penalties add a number of difficulties to the resulting optimization problems, which vanish in our case. Simulation of new estimates during the evolutionary process of EDAs is performed in such a way that guarantees their shrinkage while maintaining their probabilistic dependence relationships learnt. The EDA process is embedded in an adapted recursive feature elimination procedure, thereby providing the genes that are best markers for the classification. Results: The consistency with the literature and excellent classification performance achieved with our algorithm are illustrated on four microarray data sets: Breast, Colon, Leukemia and Prostate. Details on the last two data sets are available as supplementary material. Conclusions: We have introduced a novel EDA-based logistic regression regularizer. It implicitly shrinks the coefficients during EDA evolution process while optimizing the usual likelihood function. The approach is combined with a gene subset selection procedure and automatically tunes the required parameters. Empirical results on microarray data sets provide sparse models with confirmed genes and performing better in classification than other competing regularized methods.
引用
收藏
页码:236 / 241
页数:6
相关论文
共 41 条
[1]   Transcriptomic changes in human breast cancer progression as determined by serial analysis of gene expression [J].
Abba, MC ;
Drake, JA ;
Hawkins, KA ;
Hu, YH ;
Sun, HX ;
Notcovich, C ;
Gaddis, S ;
Sahin, A ;
Baggerly, K ;
Aldaz, CM .
BREAST CANCER RESEARCH, 2004, 6 (05) :R499-R513
[2]   Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays [J].
Alon, U ;
Barkai, N ;
Notterman, DA ;
Gish, K ;
Ybarra, S ;
Mack, D ;
Levine, AJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (12) :6745-6750
[3]  
[Anonymous], 2006, NEW EVOLUTIONARY COM
[4]   Is cross-validation valid for small-sample microarray classification? [J].
Braga-Neto, UM ;
Dougherty, ER .
BIOINFORMATICS, 2004, 20 (03) :374-380
[5]   Gene selection in cancer classification using sparse logistic regression with Bayesian regularization [J].
Cawley, Gavin C. ;
Talbot, Nicola L. C. .
BIOINFORMATICS, 2006, 22 (19) :2348-2355
[6]   HMGA1 protein overexpression in human breast carcinomas: Correlation with ErbB2 expression [J].
Chiappetta, G ;
Botti, G ;
Monaco, M ;
Pasquinelli, R ;
Pentimalli, F ;
Di Bonito, M ;
D'Aiuto, G ;
Fedele, M ;
Iuliano, R ;
Palmieri, EA ;
Pierantoni, GM ;
Giancotti, V ;
Fusco, A .
CLINICAL CANCER RESEARCH, 2004, 10 (22) :7637-7644
[7]   Comparison of discrimination methods for the classification of tumors using gene expression data [J].
Dudoit, S ;
Fridlyand, J ;
Speed, TP .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2002, 97 (457) :77-87
[8]  
Dugas M, 2006, METHOD INFORM MED, V45, P146
[9]  
EILERS P, 2000, P SOC PHOTO-OPT INS, V4266, P187
[10]   Classification using partial least squares with penalized logistic regression [J].
Fort, G ;
Lambert-Lacroix, S .
BIOINFORMATICS, 2005, 21 (07) :1104-1111