Evaporative cooling feature selection for genotypic data involving interactions

被引:29
作者
McKinney, B. A. [1 ]
Reif, D. M.
White, B. C.
Crowe, J. E., Jr.
Moore, J. H.
机构
[1] Univ Alabama Birmingham, Sch Med, Dept Genet, Birmingham, AL 35294 USA
[2] US EPA, Natl Ctr Computat Toxicol, Res Triangle Pk, NC 27711 USA
[3] Dartmouth Med Sch, Dept Genet, Computat Genet Lab, Lebanon, NH 03756 USA
[4] Vanderbilt Univ, Med Ctr, Dept Microbiol, Nashville, TN 37232 USA
[5] Vanderbilt Univ, Med Ctr, Dept Pediat, Nashville, TN 37232 USA
关键词
D O I
10.1093/bioinformatics/btm317
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: The development of genome-wide capabilities for genotyping has led to the practical problem of identifying the minimum subset of genetic variants relevant to the classification of a phenotype. This challenge is especially difficult in the presence of attribute interactions, noise and small sample size. Methods: Analogous to the physical mechanism of evaporation, we introduce an evaporative cooling (EC) feature selection algorithm that seeks to obtain a subset of attributes with the optimum information temperature (i. e. the least noise). EC uses an attribute quality measure analogous to thermodynamic free energy that combines Relief-F and mutual information to evaporate ( i. e. remove) noise features, leaving behind a subset of attributes that contain DNA sequence variations associated with a given phenotype. Results: EC is able to identify functional sequence variations that involve interactions ( epistasis) between other sequence variations that influence their association with the phenotype. This ability is demonstrated on simulated genotypic data with attribute interactions and on real genotypic data from individuals who experienced adverse events following smallpox vaccination. The EC formalism allows us to combine information entropy, energy and temperature into a single information free energy attribute quality measure that balances interaction and main effects. Availability: Open source software, written in Java, is freely available upon request. Contact: brett.mckinney@gmail.com
引用
收藏
页码:2113 / 2120
页数:8
相关论文
共 19 条
[1]  
[Anonymous], 2003, WORKSH LEARN COMP VI
[2]  
[Anonymous], 2005, Data Mining Pratical Machine Learning Tools and Techniques
[3]  
Dudek Scott M, 2006, Pac Symp Biocomput, P499, DOI 10.1142/9789812701626_0046
[4]   THRESHOLD ACCEPTING - A GENERAL-PURPOSE OPTIMIZATION ALGORITHM APPEARING SUPERIOR TO SIMULATED ANNEALING [J].
DUECK, G ;
SCHEUER, T .
JOURNAL OF COMPUTATIONAL PHYSICS, 1990, 90 (01) :161-175
[5]   Who's afraid of epistasis? [J].
Frankel, WN ;
Schork, NJ .
NATURE GENETICS, 1996, 14 (04) :371-373
[6]   EVAPORATIVE COOLING OF MAGNETICALLY TRAPPED AND COMPRESSED SPIN-POLARIZED HYDROGEN [J].
HESS, HF .
PHYSICAL REVIEW B, 1986, 34 (05) :3476-3479
[7]   INFORMATION THEORY AND STATISTICAL MECHANICS [J].
JAYNES, ET .
PHYSICAL REVIEW, 1957, 106 (04) :620-630
[8]  
KIRA K, 1992, P 10 INT C MACH LEAR
[9]  
KONONENKO I, 1996, ARTIFICIAL INTELLIGE
[10]  
KONONENKO I, 1994, EUR C MACH LEARN SPR