Euclidean distance stratified random sampling based clustering model for big data mining

被引:1
作者
Pandey, Kamlesh Kumar [1 ]
Shukla, Diwakar [1 ]
机构
[1] Dr Hari Singh Gour Vishwavidyalaya, Dept Comp Sci & Applicat, Sagar, Madhya Pradesh, India
关键词
big data mining; big data sampling; big data clustering; Euclidean distance based stratum; random sampling; sample extension; SSK-Means; stratified sampling; FRAMEWORK; ALGORITHM;
D O I
10.1002/cmm4.1206
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
Big data mining is related to large-scale data analysis and faces computational cost-related challenges due to the exponential growth of digital technologies. Classical data mining algorithms suffer from computational deficiency, memory utilization, resource optimization, scale-up, and speed-up related challenges in big data mining. Sampling is one of the most effective data reduction techniques that reduces the computational cost, improves scalability and computational speed with high efficiency for any data mining algorithm in single and multiple machine execution environments. This study suggested a Euclidean distance-based stratum method for stratum creation and a stratified random sampling-based big data mining model using the K-Means clustering (SSK-Means) algorithm in a single machine execution environment. The performance of the SSK-Means algorithm has achieved better cluster quality, speed-up, scale-up, and memory utilization against the random sampling-based K-Means and classical K-Means algorithms using silhouette coefficient, Davies Bouldin index, Calinski Harabasz index, execution time, and speedup ratio internal measures.
引用
收藏
页数:14
相关论文
共 45 条
  • [1] Aggarwal A, 2009, LECT NOTES COMPUT SC, V5687, P15, DOI 10.1007/978-3-642-03685-9_2
  • [2] Aggarwal CC, 2014, CH CRC DATA MIN KNOW, P1
  • [3] Al-Kateb M, 2010, LECT NOTES COMPUT SC, V6187, P621, DOI 10.1007/978-3-642-13818-8_42
  • [4] A sampling-based exact algorithm for the solution of the minimax diameter clustering problem
    Aloise, Daniel
    Contardo, Claudio
    [J]. JOURNAL OF GLOBAL OPTIMIZATION, 2018, 71 (03) : 613 - 630
  • [5] [Anonymous], 2007, MATH STAT DATA ANAL
  • [6] Bejarano Jeremy., 2011, UMBC Student Collection, V1, P1, DOI [DOI 10.1109/TIT.1967.1053964, 10.1109/TIT.1967.1053964]
  • [7] BEN H, 2019, INT J PATTERN RECOGN, V33
  • [8] A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering
    Ben-David, Shai
    [J]. MACHINE LEARNING, 2007, 66 (2-3) : 243 - 257
  • [9] Parallel and distributed clustering framework for big spatial data mining
    Bendechache, Malika
    Tari, A-Kamel
    Kechadi, M-Tahar
    [J]. INTERNATIONAL JOURNAL OF PARALLEL EMERGENT AND DISTRIBUTED SYSTEMS, 2019, 34 (06) : 671 - 689
  • [10] Chen B., 2002, P 8 SIGKDD INT C KNO, P462, DOI DOI 10.1145/775107.775114