Prediction using step-wise L1, L2 regularization and feature selection for small data sets with large number of features

被引：78

作者：

Demir-Kavuk, Ozgur ^{[1
]}

Kamada, Mayumi ^{[2
]}

Akutsu, Tatsuya ^{[2
]}

Knapp, Ernst-Walter ^{[1
]}

机构：

[1] Free Univ Berlin, Inst Chem & Biochem, D-14195 Berlin, Germany

[2] Kyoto Univ, Inst Chem Res, Bioinformat Ctr, Kyoto 6110011, Japan

来源：

BMC BIOINFORMATICS | 2011年 / 12卷

关键词：

CLASSIFICATION; DESCRIPTORS; REGRESSION;

D O I：

10.1186/1471-2105-12-412

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background: Machine learning methods are nowadays used for many biological prediction problems involving drugs, ligands or polypeptide segments of a protein. In order to build a prediction model a so called training data set of molecules with measured target properties is needed. For many such problems the size of the training data set is limited as measurements have to be performed in a wet lab. Furthermore, the considered problems are often complex, such that it is not clear which molecular descriptors (features) may be suitable to establish a strong correlation with the target property. In many applications all available descriptors are used. This can lead to difficult machine learning problems, when thousands of descriptors are considered and only few (e. g. below hundred) molecules are available for training. Results: The CoEPrA contest provides four data sets, which are typical for biological regression problems (few molecules in the training data set and thousands of descriptors). We applied the same two-step training procedure for all four regression tasks. In the first stage, we used optimized L1 regularization to select the most relevant features. Thus, the initial set of more than 6,000 features was reduced to about 50. In the second stage, we used only the selected features from the preceding stage applying a milder L2 regularization, which generally yielded further improvement of prediction performance. Our linear model employed a soft loss function which minimizes the influence of outliers. Conclusions: The proposed two-step method showed good results on all four CoEPrA regression tasks. Thus, it may be useful for many other biological prediction problems where for training only a small number of molecules are available, which are described by thousands of descriptors.

引用

页数：10

共 28 条

[1] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].

Altschul, SF ;

Madden, TL ;

Schaffer, AA ;

Zhang, JH ;

Zhang, Z ;

Miller, W ;

Lipman, DJ .

NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402

[2]

Andrew G., 2007, ICML 07

[3]

[Anonymous], 2005, International Journal of Advance Research in Computer Science and Management Studies

[4] Solving the protein sequence metric problem [J].

Atchley, WR ;

Zhao, JP ;

Fernandes, AD ;

Drüke, T .

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2005, 102 (18) :6395-6400

[5]

Bau D., 1997, NUMERICAL LINEAR ALG

[6]

Bellman R., 1961, Adaptive control processes - A guided tour, P255

[7] Random forests [J].

Breiman, L .

MACHINE LEARNING, 2001, 45 (01) :5-32

[8] Exploring classification strategies with the CoEPrA 2006 contest [J].

Demir-Kavuk, Ozgur ;

Riedesel, Henning ;

Knapp, Ernst-Walter .

BIOINFORMATICS, 2010, 26 (05) :603-609

[9] Variable selection via nonconcave penalized likelihood and its oracle properties [J].

Fan, JQ ;

Li, RZ .

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2001, 96 (456) :1348-1360

[10] Interpretable Numerical Descriptors of Amino Acid Space [J].

Georgiev, Alexander G. .

JOURNAL OF COMPUTATIONAL BIOLOGY, 2009, 16 (05) :703-723

← 1 2 3 →