FAST FEATURE SUBSET SELECTION IN BIOLOGICAL SEQUENCE ANALYSIS

被引:2
作者
Pudimat, Rainer [1 ]
Backofen, Rolf [1 ]
Schukat-Talamazzini, Ernst G. [2 ]
机构
[1] Univ Freiburg, Inst Informat, D-79110 Freiburg, Germany
[2] Univ Jena, Inst Informat, D-07743 Jena, Germany
关键词
Computational biology; transcription factor binding sites; feature selection; combinatorial optimization; linear predictors; combinatorial regression; kernel methods; CLASSIFICATION; ALGORITHMS; RANKING;
D O I
10.1142/S0218001409007107
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Biological research produces a wealth of measured data. Neither it is easy for biologists to postulate hypotheses about the behavior or structure of the observed entity because the relevant properties measured are not seen in the ocean of measurements. Nor is it easy to design machine learning algorithms to classify or cluster the data items for the same reason. Algorithms for automatically selecting a highly predictive subset of the measured features can help to overcome these difficulties. We present an efficient feature selection strategy which can be applied to arbitrary feature selection problems. The core technique is a new method for estimating the quality of subsets from previously calculated qualities for smaller subsets by minimizing the mean standard error of estimated values with an approach common to support vector machines. This method can be integrated in many feature subset search algorithms. We have applied it with sequential search algorithms and have been able to reduce the number of quality calculations for finding accurate feature subsets by about 70%. We show these improvements by applying our approach to the problem of finding highly predictive feature subsets for transcription factor binding sites.
引用
收藏
页码:191 / 207
页数:17
相关论文
共 34 条
[1]  
[Anonymous], Journal of machine learning research
[2]   Feature selection for genetic sequence classification [J].
Chuzhanova, NA ;
Jones, AJ ;
Margetts, S .
BIOINFORMATICS, 1998, 14 (02) :139-143
[3]  
Cristianini N, 2000, SUPPORT VECTOR MACHI, DOI [10.1017/CBO9780511801389, DOI 10.1017/CBO9780511801389]
[4]   Feature subset selection for splice site prediction [J].
Degroeve, S ;
De Baets, B ;
Van de Peer, Y ;
Rouzé, P .
BIOINFORMATICS, 2002, 18 :S75-S83
[5]   Conformational characteristics of DNA: Empirical classifications and a hypothesis for the conformational behaviour of dinucleotide steps [J].
ElHassan, MA ;
Calladine, CR .
PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 1997, 355 (1722) :43-100
[6]   Research on collaborative negotiation for e-commerce. [J].
Feng, YQ ;
Lei, Y ;
Li, Y ;
Cao, RZ .
2003 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-5, PROCEEDINGS, 2003, :2085-2088
[7]   Bayesian network classifiers [J].
Friedman, N ;
Geiger, D ;
Goldszmidt, M .
MACHINE LEARNING, 1997, 29 (2-3) :131-163
[8]   Entropy-based gene ranking without selection bias for the predictive classification of microarray data [J].
Furlanello, C ;
Serafini, M ;
Merler, S ;
Jurman, G .
BMC BIOINFORMATICS, 2003, 4 (1)
[9]  
Golub G. H., 2012, Matrix computations, V4th
[10]  
Hastie T., 2008, ELEMENTS STAT LEARNI, V2nd