Large-Scale Automatic K-Means Clustering for Heterogeneous Many-Core Supercomputer

被引:4
|
作者
Yu, Teng [1 ]
Zhao, Wenlai [2 ,3 ]
Liu, Pan [2 ,3 ]
Janjic, Vladimir [1 ]
Yan, Xiaohan [4 ]
Wang, Shicai [5 ]
Fu, Haohuan [2 ,3 ]
Yang, Guangwen [2 ,3 ]
Thomson, John [1 ]
机构
[1] Univ St Andrews, St Andrews KY16 9AJ, Fife, Scotland
[2] Tsinghua Univ, Beijing 100084, Peoples R China
[3] Natl Supercomp Ctr, Wuxi 214072, Jiangsu, Peoples R China
[4] Univ Calif Berkeley, Berkeley, CA 94720 USA
[5] Wellcome Trust Sanger Inst, Saffron Walden CB10 1SA, Essex, England
基金
英国工程与自然科学研究理事会; 国家重点研发计划; 中国博士后科学基金;
关键词
Supercomputer; heterogeneous many-core processor; data partitioning; clustering; scheduling; AutoML; ALGORITHM;
D O I
10.1109/TPDS.2019.2955467
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This article presents an automatic k-means clustering solution targeting the Sunway TaihuLight supercomputer. We first introduce a multilevel parallel partition approach that not only partitions by dataflow and centroid, but also by dimension, which unlocks the potential of the hierarchical parallelism in the heterogeneous many-core processor and the system architecture of the supercomputer. The parallel design is able to process large-scale clustering problems with up to 196,608 dimensions and over 160,000 targeting centroids, while maintaining high performance and high scalability. Furthermore, we propose an automatic hyper-parameter determination process for k-means clustering, by automatically generating and executing the clustering tasks with a set of candidate hyper-parameter, and then determining the optimal hyper-parameter using a proposed evaluation method. The proposed auto-clustering solution can not only achieve high performance and scalability for problems with massive high-dimensional data, but also support clustering without sufficient prior knowledge for the number of targeted clusters, which can potentially increase the scope of k-means algorithm to new application areas.
引用
收藏
页码:997 / 1008
页数:12
相关论文
共 50 条
  • [21] Large-scale k-means clustering with user-centric privacy-preservation
    Sakuma, Jun
    Kobayashi, Shigenobu
    KNOWLEDGE AND INFORMATION SYSTEMS, 2010, 25 (02) : 253 - 279
  • [22] Hierarchical K-means Method for Clustering Large-Scale Advanced Metering Infrastructure Data
    Xu, Tian-Shi
    Chiang, Hsiao-Dong
    Liu, Guang-Yi
    Tan, Chin-Woo
    IEEE TRANSACTIONS ON POWER DELIVERY, 2017, 32 (02) : 609 - 616
  • [23] Large-scale Parallel Design for Cryo-EM Structure Determination on Heterogeneous Many-core Architectures
    Qiao, Liang
    Yu, Hongkun
    Wang, Kunpeng
    Sun, Ruixin
    Zhao, Wenlai
    Fu, Haohuan
    Yang, Guangwen
    2019 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2019, : 711 - 716
  • [24] An OpenCL Compiler for the Homegrown Heterogeneous Many-Core Processor on the Sunway TaihuLight Supercomputer
    Wu M.-C.
    Huang L.
    Liu Y.
    He X.-B.
    Feng X.-B.
    Liu, Ying (liuying2007@ict.ac.cn), 2018, Science Press (41): : 2236 - 2250
  • [25] A Semantic Partition Algorithm Based on Improved K-Means Clustering for Large-Scale Indoor Areas
    Shi, Kegong
    Yan, Jinjin
    Yang, Jinquan
    ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION, 2024, 13 (02)
  • [26] K-means Clustering Algorithm for Large-scale Chinese Commodity Information Web Based on Hadoop
    Geng Yushui
    Zhang Lishuo
    14TH INTERNATIONAL SYMPOSIUM ON DISTRIBUTED COMPUTING AND APPLICATIONS FOR BUSINESS, ENGINEERING AND SCIENCE (DCABES 2015), 2015, : 256 - 259
  • [27] Deep clustering of small molecules at large-scale via variational autoencoder embedding and K-means
    Hamid Hadipour
    Chengyou Liu
    Rebecca Davis
    Silvia T. Cardona
    Pingzhao Hu
    BMC Bioinformatics, 23
  • [28] Optimal Operation of Large-scale Electric Vehicles Based on Improved K-means Clustering Algorithm
    Liu, Jian
    Xu, Weifeng
    Liu, Zhijun
    Fu, Guanhua
    Jiang, Yunpeng
    Zhao, Ergang
    PROCEEDINGS OF 2022 5TH INTERNATIONAL CONFERENCE ON ROBOT SYSTEMS AND APPLICATIONS, ICRSA2022, 2022, : 23 - 28
  • [29] Efficient adaptive large-scale text clustering method based on genetic K-means algorithm
    Dai, Wenhua
    Jiao, Cuizhen
    He, Tingting
    RECENT ADVANCE OF CHINESE COMPUTING TECHNOLOGIES, 2007, : 281 - 285
  • [30] A MapReduce-based parallel K-means clustering for large-scale CIM data verification
    Deng, Chuang
    Liu, Yang
    Xu, Lixiong
    Yang, Jie
    Liu, Junyong
    Li, Siguang
    Li, Maozhen
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2016, 28 (11): : 3096 - 3114