Comparative Study of Disease Classification Using Multiple Machine Learning Models Based on Landmark and Non-Landmark Gene Expression Data

被引:3
作者
Huang, Xiaoqin [1 ]
Sun, Jian [2 ]
Srinivasan, Satish Mahadevan [1 ]
Sangwan, Raghvinder S. [1 ]
机构
[1] Penn State Univ, Engn Dept, 30 Swedesford Rd, Malvern, PA 19355 USA
[2] German Ctr Neurodegenerat Dis DZNE, Otfried Muller Str 23, D-72076 Tubingen, Germany
来源
BIG DATA, IOT, AND AI FOR A SMARTER FUTURE | 2021年 / 185卷
关键词
Landmark Gene; Disease Classification; Machine Learning; Artificial Neural Network; Gene Expression Analysis;
D O I
10.1016/j.procs.2021.05.028
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This study compares disease classification based on landmark and non-landmark gene expression data, and clinical variable using multiple machine-learning models. The influence of the number of principal components and the genes were also investigated. The results indicate that the ANN model has the best accuracy for disease type prediction among all the models, model using 95 principal components has better accuracy than that of 25 principal components, and the greater number of genes used, the higher the prediction accuracy. Models using landmark genes demonstrated better accuracy than the models using non-landmark genes especially with 95 PCs across all the models except for the decision trees. The optimal model was one that uses landmark genes with 95 PCs as features for an ANN classifier. The AUC measures obtained on the test set were 0.98,0.98,1 and 0.96 for Autoimmune, Bacteremia, Cancer and Healthy classes respectively, and the accuracy for the respective classes were 97.56%, 95.65%, 95.65%, and 58.82%. The ANN model demonstrated a good capability of distinguishing between the true positives and the false positives, and it resulted in high prediction accuracy for the 3 disease classes (Autoimmune, Bacteremia, Cancer), but it misclassified some instances from the Healthy class to the Autoimmune and Bacteremia class, likely due to a wide range of gene expression level for the Healthy class. (c) 2021 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0) Peer-review under responsibility of the scientific committee of the Complex Adaptive Systems Conference, June 2021.
引用
收藏
页码:264 / 273
页数:10
相关论文
共 23 条
[1]   A novel gene selection algorithm for cancer classification using microarray datasets [J].
Alanni, Russul ;
Hou, Jingyu ;
Azzawi, Hasseeb ;
Xiang, Yong .
BMC MEDICAL GENOMICS, 2019, 12 (1)
[2]   Natural variation of macrophage activation as disease-relevant phenotype predictive of inflammation and cancer survival [J].
Buscher, Konrad ;
Ehinger, Erik ;
Gupta, Pritha ;
Pramod, Akula Bala ;
Wolf, Dennis ;
Tweet, George ;
Pan, Calvin ;
Mills, Charles D. ;
Lusis, Aldons J. ;
Ley, Klaus .
NATURE COMMUNICATIONS, 2017, 8
[3]   Gene expression profiling:: monitoring transcription and translation products using DNA microarrays and proteomics [J].
Celis, JE ;
Kruhoffer, M ;
Gromova, I ;
Frederiksen, C ;
Ostergaard, M ;
Thykjaer, T ;
Gromov, P ;
Yu, JS ;
Pálsdóttir, H ;
Magnusson, N ;
Orntoft, TF .
FEBS LETTERS, 2000, 480 (01) :2-16
[4]   Interplay between gene expression noise and regulatory network architecture [J].
Chalancon, Guilhem ;
Ravarani, Charles N. J. ;
Balaji, S. ;
Martinez-Arias, Alfonso ;
Aravind, L. ;
Jothi, Raja ;
Babu, M. Madan .
TRENDS IN GENETICS, 2012, 28 (05) :221-232
[5]   Decision Tree and Ensemble Learning Algorithms with Their Applications in Bioinformatics [J].
Che, Dongsheng ;
Liu, Qi ;
Rasheed, Khaled ;
Tao, Xiuping .
SOFTWARE TOOLS AND ALGORITHMS FOR BIOLOGICAL SYSTEMS, 2011, 696 :191-199
[6]   XGBoost: A Scalable Tree Boosting System [J].
Chen, Tianqi ;
Guestrin, Carlos .
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794
[7]   Gene expression inference with deep learning [J].
Chen, Yifei ;
Li, Yi ;
Narayan, Rajiv ;
Subramanian, Aravind ;
Xie, Xiaohui .
BIOINFORMATICS, 2016, 32 (12) :1832-1839
[8]   K-means Clustering and Principal Components Analysis of Microarray Data of L1000 Landmark Genes [J].
Clayman, Carly L. ;
Srinivasan, Satish M. ;
Sangwan, Raghvinder S. .
COMPLEX ADAPTIVE SYSTEMS, 2020, 168 :97-104
[9]   The GCTx format and cmap{Py, R, M, J} packages: resources for optimized storage and integrated traversal of annotated dense matrices [J].
Enache, Oana M. ;
Lahr, David L. ;
Natoli, Ted E. ;
Litichevskiy, Lev ;
Wadden, David ;
Flynn, Corey ;
Gould, Joshua ;
Asiedu, Jacob K. ;
Narayan, Rajiv ;
Subramanian, Aravind .
BIOINFORMATICS, 2019, 35 (08) :1427-1429
[10]  
Gharehchopogh F.S., 2013, INT J COMPUTER APPL, V73