Comparison of Supervised Models in Hepatocellular Carcinoma Tumor Classification Based on Expression Data Using Principal Component Analysis (PCA)

被引:1
作者
Siregar, Anggrainy Togi Marito [1 ]
Siswantining, Titin [1 ]
Bustamam, Alhadi [1 ]
Sarwinda, Devvi [1 ]
机构
[1] Univ Indonesia, Fac Math & Nat Sci, Dept Math, Lab Bioinformat & Adv Comp, Kampus Baru UI, Depok 16424, Indonesia
来源
SYMPOSIUM ON BIOMATHEMATICS 2019 (SYMOMATH 2019) | 2020年 / 2264卷
关键词
Dimension Reduction; Gene Expression; Hepatocellular Carcinoma; Support Vector Classifier; PCA; REDUCTION;
D O I
10.1063/5.0023931
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Hepatocellular Carcinoma is one of the cancer disease cases with a high dead population. To know that someone is affected by Hepatocellular Carcinoma Tumor by observing the expression of genes on DNA. Gene expression obtained from the microarray laboratory tool that produced genes probe. In this case, there are 54675 gene expressions with 40 samples (homo sapiens). Many expression genes will be difficult to classify someone affected or not affected by Hepatocellular Carcinoma Tumor. We must take action to minimize the features without losing the data information. One of the tools to reduction dimension in Machine learning is Principal Component Analysis (PCA). Principal Component Analysis is a multivariate analysis that transforms correlated origin features into new features that do not correlate with each other by reducing the number of these features so that they have smaller dimensions but can explain most of the diversity of the original features. The objective of this research is to find the best percentage of features that have generated from PCA then fitting some models. The models that we use are Logistic Regression Classifier, Support Vector Machine (SVM) Classifier, and Random Forest Classifier. A Logistic regression model is able to provide the best accuracy starting from 40% of its variance on PCA made, which is equal to 0.875. While the Random Forest Classifier and Support Vector Machine can provide an accuracy of 0.875 when the value of the variance is above 60% of the variance. The result can give information to select the best percent in using PCA as a reduction dimension especially, for gene expression on Microarray data.
引用
收藏
页数:8
相关论文
共 12 条
[1]   Clustering and Analyzing Microarray Data of Lymphoma Using Singular Value Decomposition (SVD) and Hybrid Clustering [J].
Bustamam, A. ;
Formalidin, S. ;
Siswantining, T. .
PROCEEDINGS OF THE 3RD INTERNATIONAL SYMPOSIUM ON CURRENT PROGRESS IN MATHEMATICS AND SCIENCES 2017 (ISCPMS2017), 2018, 2023
[2]  
Bustamam A., 2018, AIP C P, V2023
[3]  
Cutler A., TREES RANDOM FORESTS
[4]   Some dimension reduction strategies for the analysis of survey data [J].
Weng J. ;
Young D.S. .
Journal of Big Data, 4 (1)
[5]  
Jolliffe Ian, 2002, Principal Component Analysis, Series: Springer Series in Statistics
[6]   Microarray-Based Gene Expression Analysis of Hepatocellular Carcinoma [J].
Maass, Thorsten ;
Sfakianakis, Ioannis ;
Staib, Frank ;
Krupp, Markus ;
Galle, Peter R. ;
Teufel, Andreas .
CURRENT GENOMICS, 2010, 11 (04) :261-268
[7]  
Mining D., 2017, SPRINGER SERIES STAT
[8]  
Octaviani T. Lidya, 2019, IOP C SERIES MAT SCI, V546
[9]  
Rencher AC., 2012, METHODS MULTIVARIATE, DOI 10.1002/9781118391686
[10]  
Tanaka M., 2011, J EPIDEMIOL