Effective Feature Selection Method for Class-Imbalance Datasets Applied to Chemical Toxicity Prediction

被引:14
作者
Antelo-Collado, Aurelio [1 ]
Carrasco-Velar, Ramon [1 ]
Garcia-Pedrajas, Nicolas [2 ]
Cerruela-Garcia, Gonzalo [2 ]
机构
[1] Univ Informat Sci, Cheminformat Grp, Havana 19370, Cuba
[2] Univ Cordoba, Dept Comp & Numer Anal, E-14071 Cordoba, Spain
关键词
THROUGHPUT SCREENING DATA; DATA SETS; DESCRIPTORS;
D O I
10.1021/acs.jcim.0c00908
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
During the drug development process, it is common to carry out toxicity tests and adverse effect studies, which are essential to guarantee patient safety and the success of the research. The use of in silico quantitative structure-activity relationship (QSAR) approaches for this task involves processing a huge amount of data that, in many cases, have an imbalanced distribution of active and inactive samples. This is usually termed the class-imbalance problem and may have a significant negative effect on the performance of the learned models. The performance of feature selection (FS) for QSAR models is usually damaged by the class-imbalance nature of the involved datasets. This paper proposes the use of an FS method focused on dealing with the class-imbalance problems. The method is based on the use of FS ensembles constructed by boosting and using two well-known FS methods, fast clustering-based FS and the fast correlation-based filter. The experimental results demonstrate the efficiency of the proposal in terms of the classification performance compared to standard methods. The proposal can be extended to other FS methods and applied to other problems in cheminformatics.
引用
收藏
页码:76 / 94
页数:19
相关论文
共 49 条
[1]   Consensus Modeling for HTS Assays Using In silico Descriptors Calculates the Best Balanced Accuracy in Tox21 Challenge [J].
Abdelaziz, Ahmed ;
Spahn-Langguth, Hilde ;
Schramm, Karl-Werner ;
Tetko, Igor, V .
FRONTIERS IN ENVIRONMENTAL SCIENCE, 2016, 4
[2]  
Al-Shahib Ali, 2005, Appl Bioinformatics, V4, P195, DOI 10.2165/00822942-200594030-00004
[3]  
[Anonymous], 1999, Advances in kernel methods: Support vector learning
[4]  
[Anonymous], 2000, Seventeenth International Conference on Machine Learning
[5]  
[Anonymous], 2011, Acm T. Intel. Syst. Tec., DOI DOI 10.1145/1961189.1961199
[6]   Mapping Drug Physico-Chemical Features to Pathway Activity Reveals Molecular Networks Linked to Toxicity Outcome [J].
Antczak, Philipp ;
Ortega, Fernando ;
Chipman, J. Kevin ;
Falciani, Francesco .
PLOS ONE, 2010, 5 (08)
[7]  
Batista G, 2004, ACM SIGKDD Explor Newsl, V6, P20, DOI DOI 10.1145/1007730.1007735
[8]   Oversampling to Overcome Overfitting: Exploring the Relationship between Data Set Composition, Molecular Descriptors, and Predictive Modeling Methods [J].
Chang, Chia-Yun ;
Hsu, Ming-Tsung ;
Esposito, Emilio Xavier ;
Tseng, Yufeng J. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2013, 53 (04) :958-971
[9]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[10]  
Chemaxon, CHEMAXON DOC