Large-Scale Automatic K-Means Clustering for Heterogeneous Many-Core Supercomputer

被引:4
|
作者
Yu, Teng [1 ]
Zhao, Wenlai [2 ,3 ]
Liu, Pan [2 ,3 ]
Janjic, Vladimir [1 ]
Yan, Xiaohan [4 ]
Wang, Shicai [5 ]
Fu, Haohuan [2 ,3 ]
Yang, Guangwen [2 ,3 ]
Thomson, John [1 ]
机构
[1] Univ St Andrews, St Andrews KY16 9AJ, Fife, Scotland
[2] Tsinghua Univ, Beijing 100084, Peoples R China
[3] Natl Supercomp Ctr, Wuxi 214072, Jiangsu, Peoples R China
[4] Univ Calif Berkeley, Berkeley, CA 94720 USA
[5] Wellcome Trust Sanger Inst, Saffron Walden CB10 1SA, Essex, England
基金
英国工程与自然科学研究理事会; 国家重点研发计划; 中国博士后科学基金;
关键词
Supercomputer; heterogeneous many-core processor; data partitioning; clustering; scheduling; AutoML; ALGORITHM;
D O I
10.1109/TPDS.2019.2955467
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This article presents an automatic k-means clustering solution targeting the Sunway TaihuLight supercomputer. We first introduce a multilevel parallel partition approach that not only partitions by dataflow and centroid, but also by dimension, which unlocks the potential of the hierarchical parallelism in the heterogeneous many-core processor and the system architecture of the supercomputer. The parallel design is able to process large-scale clustering problems with up to 196,608 dimensions and over 160,000 targeting centroids, while maintaining high performance and high scalability. Furthermore, we propose an automatic hyper-parameter determination process for k-means clustering, by automatically generating and executing the clustering tasks with a set of candidate hyper-parameter, and then determining the optimal hyper-parameter using a proposed evaluation method. The proposed auto-clustering solution can not only achieve high performance and scalability for problems with massive high-dimensional data, but also support clustering without sufficient prior knowledge for the number of targeted clusters, which can potentially increase the scope of k-means algorithm to new application areas.
引用
收藏
页码:997 / 1008
页数:12
相关论文
共 50 条
  • [41] Accelerating collision detection for large-scale crowd simulation on multi-core and many-core architectures
    Vigueras, Guillermo
    Orduna, Juan M.
    Lozano, Miguel
    Cecilia, Jose M.
    Garcia, Jose M.
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2014, 28 (01): : 33 - 49
  • [42] Discriminative Hierarchical K-Means Tree for Large-Scale Image Classification
    Chen, Shizhi
    Yang, Xiaodong
    Tian, Yingli
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2015, 26 (09) : 2200 - 2205
  • [43] K-Means Spreading Factor Allocation for Large-Scale LoRa Networks
    Ullah, Muhammad Asad
    Iqbal, Junnaid
    Hoeller, Arliones
    Souza, Richard Demo
    Alves, Hirley
    SENSORS, 2019, 19 (21)
  • [44] Implementation and optimization of a data protecting model on the Sunway TaihuLight supercomputer with heterogeneous many-core processors
    Chen, Yuedan
    Li, Kenli
    Fei, Xiongwei
    Quan, Zhe
    Li, Keqin
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2019, 31 (21):
  • [45] UMAP-assisted K-means clustering of large-scale SARS-CoV-2 mutation datasets
    Hozumi, Yuta
    Wang, Rui
    Yin, Changchuan
    Wei, Guo-Wei
    COMPUTERS IN BIOLOGY AND MEDICINE, 2021, 131
  • [46] Multi-Core for K-Means Clustering on FPGA
    Canilho, Jose
    Vestias, Mario
    Neto, Horacio
    2016 26TH INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE LOGIC AND APPLICATIONS (FPL), 2016,
  • [47] Enabling Highly Efficient k-Means Computations on the SW26010 Many-Core Processor of Sunway TaihuLight
    Li, Min
    Yang, Chao
    Sun, Qiao
    Ma, Wen-Jing
    Cao, Wen-Long
    Ao, Yu-Long
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2019, 34 (01) : 77 - 93
  • [48] Enabling Highly Efficient k-Means Computations on the SW26010 Many-Core Processor of Sunway TaihuLight
    Min Li
    Chao Yang
    Qiao Sun
    Wen-Jing Ma
    Wen-Long Cao
    Yu-Long Ao
    Journal of Computer Science and Technology, 2019, 34 : 77 - 93
  • [49] Multiobjective Clustering with Automatic k-determination for Large-scale Data
    Matake, Nobukazu
    Hiroyasu, Tomoyuki
    Miki, Mitsunori
    Senda, Tomoharu
    GECCO 2007: GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE, VOL 1 AND 2, 2007, : 861 - +
  • [50] Efficiency of Parallel Large-Scale Two-Layered MLP Training on Many-Core System
    Turchenko, Volodymyr
    Sachenko, Anatoly
    NEURAL NETWORKS AND ARTIFICIAL INTELLIGENCE, ICNNAI 2014, 2014, 440 : 201 - 210