Distributed Clustering Approach by Apache Pyspark Based on SEER for Clinical Data

被引:0
作者
Ramesh, R. [1 ]
Judy, M. V. [1 ]
机构
[1] Cochin Univ Scienc e & Technol CUSAT Cochin, Dept Comp Applicat, Kochi 682022, Kerala, India
关键词
Computational epidemiology; big data analytics; spark partitions; Apache spark; Hadoop clustering; NEW-YORK;
D O I
10.1142/S0218001422400067
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data clustering is a thoroughly studied data mining issue. As the amount of information being analyzed grows exponentially, there are several problems with clustering diagnostic large datasets like the monitoring, microbiology, and end results (SEER) carcinoma feature sets. These traditional clustering methods are severely constrained in terms of speed, productivity, and adaptability. This paper summarizes the most modern distributed clustering algorithms, organized according to the computing platforms used to process vast volumes of data. The purpose of this work was to offer an optimized distributed clustering strategy for reducing the algorithm's total execution time. We obtained, preprocessed, and analyzed clinical SEER data on liver cancer, respiratory cancer, human immunodeficiency virus (HIV)-related lymphoma, and lung cancer for large-scale data clustering analysis. Three major contributions and their effects were covered in this paper: To begin, three current Pyspark distributed clustering algorithms were evaluated on SEER clinical data using a simulated New York cancer dataset. Second, systemic inflammatory response syndrome (SIRS) model inference was done and described using three SEER cancer datasets. Third, employing lung cancer data, we suggested an optimized distributed bisecting k-means method. We have shown the outcomes of our suggested optimized distributed clustering technique, demonstrating the performance enhancement.
引用
收藏
页数:23
相关论文
共 25 条
[1]   Public domain small-area cancer incidence data for New York State, 2005-2009 [J].
Boscoe, Francis P. ;
Talbot, Thomas O. ;
Kulldorff, Martin .
GEOSPATIAL HEALTH, 2016, 11 (01) :3-10
[2]   Clustering Cancer Data by Areas between Survival Curves [J].
Chen, Dechang ;
Wang, Huan ;
Henson, Donald E. ;
Sheng, Li ;
Hueman, Matthew T. ;
Schwartz, Arnold M. .
2016 IEEE FIRST INTERNATIONAL CONFERENCE ON CONNECTED HEALTH: APPLICATIONS, SYSTEMS AND ENGINEERING TECHNOLOGIES (CHASE), 2016, :61-66
[3]  
De Silva Daswin, 2007, 2007 3rd International Conference on Information and Automation for Sustainability (ICIAFS '07), P63, DOI 10.1109/ICIAFS.2007.4544781
[4]  
Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
[5]   EPIDEMIOLOGY [J].
DIONNE, PJ .
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, 1972, BM19 (02) :126-&
[6]  
Fradkin D, 2006, 200535 DIMACS
[7]  
Ganczak M., 2014, PRZEGL EPIDEMIOL, V68, P169
[8]  
Ganczak Maria, 2014, Przegl Epidemiol, V68, P89
[9]  
General Board of Health, REP COMM SCI INQ REL
[10]   Big Data Software Analytics with Apache Spark [J].
Gousios, Georgios .
PROCEEDINGS 2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING - COMPANION (ICSE-COMPANION, 2018, :542-543