Identifying Biological Pathway Interrupting Toxins Using Multi-Tree Ensembles

被引:13
作者
Barta, Gergo [1 ]
机构
[1] Budapest Univ Technol & Econ, Data Min Grp, Data & Content Technol Lab, Dept Telecommun & Media Informat, Budapest, Hungary
关键词
Classification; random forest; toxicity; Tox21; challenge; competition;
D O I
10.3389/fenvs.2016.00052
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
The pharmaceutical industry constantly seeks new ways to improve current methods that scientists use to evaluate environmental chemicals and develop new medicines. Various automated steps are involved in the process as testing hundreds of thousands of chemicals manually would be infeasible. Our research effort and the Toxicology in the Twenty First Century Data Challenge focused on cost-effective automation of toxicological testing, a chemical substance screening process looking for possible toxic effects caused by interrupting biological pathways. The computational models we propose in this paper successfully combine various publicly available substance fingerprinting tools with advanced machine learning techniques. In our paper, we explore the significance and utility of assorted feature selection methods as the structural analyzers generate a plethora of features for each substance. Machine learning models were carefully selected and evaluated based on their capability to cope with the high-dimensional high-variety data with multi-tree ensemble methods coming out on top. Techniques like Random forests and Extra trees combine numerous simple tree models and proved to produce reliable predictions on toxic activity while being nearly non-parametric and insensitive to dimensionality extremes. The Tox21 Data Challenge contest offered a great platform to compare a wide range of solutions in a controlled and orderly manner. The results clearly demonstrate that the generic approach presented in this paper is comparable to advanced deep learning and domain-specific solutions. Even surpassing the competition in some nuclear receptor signaling and stress pathway assays and achieving an accuracy of up to 94 percent.
引用
收藏
页数:12
相关论文
共 20 条
[1]  
Bolton EE, 2010, ANN REP COMP CHEM, V4, P217, DOI 10.1016/S1574-1400(08)00012-1
[2]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[3]  
Chen C., 2004, USING RANDOM FOREST
[4]   Performance of some variable selection methods when multicollinearity is present [J].
Chong, IG ;
Jun, CH .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2005, 78 (1-2) :103-112
[5]  
Dahl G.E., 2014, ARXIV14061231
[6]  
Efron B., 1993, MONOGRAPHS STAT APPL, DOI [DOI 10.1007/978-1-4899-4541-9, 10.1007/978-1-4899-4541-9]
[7]   ITERATIVE PARTIAL EQUALIZATION OF ORBITAL ELECTRONEGATIVITY - A RAPID ACCESS TO ATOMIC CHARGES [J].
GASTEIGER, J ;
MARSILI, M .
TETRAHEDRON, 1980, 36 (22) :3219-3228
[8]   Extremely randomized trees [J].
Geurts, P ;
Ernst, D ;
Wehenkel, L .
MACHINE LEARNING, 2006, 63 (01) :3-42
[9]   Tox21 Challenge to Build Predictive Models of Nuclear Receptor and Stress Response Pathways as Mediated by Exposure to Environmental Chemicals and Drugs [J].
Huang, Ruili ;
Xia, Menghang ;
Nguyen, Dac-Trung ;
Zhao, Tongan ;
Sakamuru, Srilatha ;
Zhao, Jinghua ;
Shahane, Sampada A. ;
Rossoshek, Anna ;
Simeonov, Anton .
FRONTIERS IN ENVIRONMENTAL SCIENCE, 2016, 3
[10]   A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model [J].
Judson, Richard ;
Elloumi, Fathi ;
Setzer, R. Woodrow ;
Li, Zhen ;
Shah, Imran .
BMC BIOINFORMATICS, 2008, 9 (1)