An enhanced deterministic K-Means clustering algorithm for cancer subtype prediction from gene expression data

被引:66
作者
Nidheesh, N. [1 ]
Nazeer, K. A. Abdul [2 ]
Ameer, P. M. [1 ]
机构
[1] Natl Inst Technol Calicut, Dept Elect & Commun Engn, Calicut 673601, Kerala, India
[2] Natl Inst Technol Calicut, Dept Comp Sci & Engn, Calicut 673601, Kerala, India
关键词
K-Means; Clustering; Cancer subtype prediction; Centroid initialization; Density based; Gene expression data; CLASS DISCOVERY;
D O I
10.1016/j.compbiomed.2017.10.014
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Backgrounth: Clustering algorithms with steps involving randomness usually give different results on different executions for the same dataset. This non-deterministic nature of algorithms such as the K-Means clustering algorithm limits their applicability in areas such as cancer subtype prediction using gene expression data. It is hard to sensibly compare the results of such algorithms with those of other algorithms. The non-deterministic nature of K-Means is due to its random selection of data points as initial centroids. Method: We propose an improved, density based version of K-Means, which involves a novel and systematic method for selecting initial centroids. The key idea of the algorithm is to select data points which belong to dense regions and which are adequately separated in feature space as the initial centroids. Results: We compared the proposed algorithm to a set of eleven widely used single clustering algorithms and a prominent ensemble clustering algorithm which is being used for cancer data classification, based on the performances on a set of datasets comprising ten cancer gene expression datasets. The proposed algorithm has shown better overall performance than the others. Conclusion: There is a pressing need in the Biomedical domain for simple, easy-to-use and more accurate Machine Learning tools for cancer subtype prediction. The proposed algorithm is simple, easy-to-use and gives stable results. Moreover, it provides comparatively better predictions of cancer subtypes from gene expression data.
引用
收藏
页码:213 / 221
页数:9
相关论文
共 40 条
[1]   Multiobjective Simulated Annealing-Based Clustering of Tissue Samples for Cancer Diagnosis [J].
Acharya, Sudipta ;
Saha, Sriparna ;
Thadisina, Yamini .
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2016, 20 (02) :691-698
[2]   Supervised, Unsupervised, and Semi-Supervised Feature Selection: A Review on Gene Selection [J].
Ang, Jun Chin ;
Mirzal, Andri ;
Haron, Habibollah ;
Hamed, Haza Nuzly Abdull .
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2016, 13 (05) :971-989
[3]  
[Anonymous], 2017, R LANG ENV STAT COMP
[4]  
[Anonymous], 2007, P 18 ANN ACM SIAM S
[5]  
[Anonymous], 2015, R Top. Doc
[6]  
[Anonymous], INTERLEUKIN 6 STAT3
[7]  
[Anonymous], J STAT SOFTW
[8]  
[Anonymous], NEURAL GAS NETWORK L
[9]  
Arthur D., 2006, Proceedings of the Twenty-Second Annual Symposium on Computational Geometry (SCG'06), P144, DOI 10.1145/1137856.1137880
[10]   NCBI GEO: archive for functional genomics data sets-update [J].
Barrett, Tanya ;
Wilhite, Stephen E. ;
Ledoux, Pierre ;
Evangelista, Carlos ;
Kim, Irene F. ;
Tomashevsky, Maxim ;
Marshall, Kimberly A. ;
Phillippy, Katherine H. ;
Sherman, Patti M. ;
Holko, Michelle ;
Yefanov, Andrey ;
Lee, Hyeseung ;
Zhang, Naigong ;
Robertson, Cynthia L. ;
Serova, Nadezhda ;
Davis, Sean ;
Soboleva, Alexandra .
NUCLEIC ACIDS RESEARCH, 2013, 41 (D1) :D991-D995