Large-Scale Automatic K-Means Clustering for Heterogeneous Many-Core Supercomputer

被引:4
|
作者
Yu, Teng [1 ]
Zhao, Wenlai [2 ,3 ]
Liu, Pan [2 ,3 ]
Janjic, Vladimir [1 ]
Yan, Xiaohan [4 ]
Wang, Shicai [5 ]
Fu, Haohuan [2 ,3 ]
Yang, Guangwen [2 ,3 ]
Thomson, John [1 ]
机构
[1] Univ St Andrews, St Andrews KY16 9AJ, Fife, Scotland
[2] Tsinghua Univ, Beijing 100084, Peoples R China
[3] Natl Supercomp Ctr, Wuxi 214072, Jiangsu, Peoples R China
[4] Univ Calif Berkeley, Berkeley, CA 94720 USA
[5] Wellcome Trust Sanger Inst, Saffron Walden CB10 1SA, Essex, England
基金
英国工程与自然科学研究理事会; 国家重点研发计划; 中国博士后科学基金;
关键词
Supercomputer; heterogeneous many-core processor; data partitioning; clustering; scheduling; AutoML; ALGORITHM;
D O I
10.1109/TPDS.2019.2955467
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This article presents an automatic k-means clustering solution targeting the Sunway TaihuLight supercomputer. We first introduce a multilevel parallel partition approach that not only partitions by dataflow and centroid, but also by dimension, which unlocks the potential of the hierarchical parallelism in the heterogeneous many-core processor and the system architecture of the supercomputer. The parallel design is able to process large-scale clustering problems with up to 196,608 dimensions and over 160,000 targeting centroids, while maintaining high performance and high scalability. Furthermore, we propose an automatic hyper-parameter determination process for k-means clustering, by automatically generating and executing the clustering tasks with a set of candidate hyper-parameter, and then determining the optimal hyper-parameter using a proposed evaluation method. The proposed auto-clustering solution can not only achieve high performance and scalability for problems with massive high-dimensional data, but also support clustering without sufficient prior knowledge for the number of targeted clusters, which can potentially increase the scope of k-means algorithm to new application areas.
引用
收藏
页码:997 / 1008
页数:12
相关论文
共 50 条
  • [31] Deep clustering of small molecules at large-scale via variational autoencoder embedding and K-means
    Hadipour, Hamid
    Liu, Chengyou
    Davis, Rebecca
    Cardona, Silvia T.
    Hu, Pingzhao
    BMC BIOINFORMATICS, 2022, 23 (SUPPL 4)
  • [32] Automatic Determination of K in Distributed K-Means Clustering
    Kotary, Dinesh Kumar
    Nanda, Satyasai Jagannath
    2ND INTERNATIONAL CONFERENCE ON RECENT TRENDS IN ADVANCED COMPUTING ICRTAC -DISRUP - TIV INNOVATION , 2019, 2019, 165 : 556 - 564
  • [33] Enhancing Performance of Large-scale Electronic Structure Calculations with Many-core Computing
    Ryu, Hoon
    Jeong, Yosang
    2016 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2016, : 142 - 143
  • [34] Heterogeneous Parallel and Distributed Optimization of K-means Algorithm on Sunway Supercomputer
    Chen, Jiawei
    Tan, Rong
    Zhang, Yiwen
    2017 15TH IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS AND 2017 16TH IEEE INTERNATIONAL CONFERENCE ON UBIQUITOUS COMPUTING AND COMMUNICATIONS (ISPA/IUCC 2017), 2017, : 931 - 937
  • [35] Very large-scale data classification based on K-means clustering and multi-kernel SVM
    Tinglong Tang
    Shengyong Chen
    Meng Zhao
    Wei Huang
    Jake Luo
    Soft Computing, 2019, 23 : 3793 - 3801
  • [36] Very large-scale data classification based on K-means clustering and multi-kernel SVM
    Tang, Tinglong
    Chen, Shengyong
    Zhao, Meng
    Huang, Wei
    Luo, Jake
    SOFT COMPUTING, 2019, 23 (11) : 3793 - 3801
  • [37] A Simple but Powerful Heuristic Method for Accelerating k-Means Clustering of Large-Scale Data in Life Science
    Ichikawa, Kazuki
    Morishita, Shinichi
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2014, 11 (04) : 681 - 692
  • [38] A sample-based hierarchical adaptive K-means clustering method for large-scale video retrieval
    Liao, Kaiyang
    Liu, Guizhong
    Xiao, Li
    Liu, Chaoteng
    KNOWLEDGE-BASED SYSTEMS, 2013, 49 : 123 - 133
  • [39] Practical Privacy-Preserving MapReduce Based K-Means Clustering Over Large-Scale Dataset
    Yuan, Jiawei
    Tian, Yifan
    IEEE TRANSACTIONS ON CLOUD COMPUTING, 2019, 7 (02) : 568 - 579
  • [40] Decentralized Thermal-Aware Task Scheduling for Large-Scale Many-Core Systems
    Cui, Yingnan
    Zhang, Wei
    Chaturvedi, Vivek
    He, Bingsheng
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2016, 24 (06) : 2075 - 2088