Large-Scale Automatic K-Means Clustering for Heterogeneous Many-Core Supercomputer

被引：4

作者：

Yu, Teng ^{[1
]}

Zhao, Wenlai ^{[2
,3
]}

Liu, Pan ^{[2
,3
]}

Janjic, Vladimir ^{[1
]}

Yan, Xiaohan ^{[4
]}

Wang, Shicai ^{[5
]}

Fu, Haohuan ^{[2
,3
]}

Yang, Guangwen ^{[2
,3
]}

Thomson, John ^{[1
]}

机构：

[1] Univ St Andrews, St Andrews KY16 9AJ, Fife, Scotland

[2] Tsinghua Univ, Beijing 100084, Peoples R China

[3] Natl Supercomp Ctr, Wuxi 214072, Jiangsu, Peoples R China

[4] Univ Calif Berkeley, Berkeley, CA 94720 USA

[5] Wellcome Trust Sanger Inst, Saffron Walden CB10 1SA, Essex, England

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2020年 / 31卷 / 05期

基金：

英国工程与自然科学研究理事会; 国家重点研发计划; 中国博士后科学基金;

关键词：

Supercomputer; heterogeneous many-core processor; data partitioning; clustering; scheduling; AutoML; ALGORITHM;

D O I：

10.1109/TPDS.2019.2955467

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

This article presents an automatic k-means clustering solution targeting the Sunway TaihuLight supercomputer. We first introduce a multilevel parallel partition approach that not only partitions by dataflow and centroid, but also by dimension, which unlocks the potential of the hierarchical parallelism in the heterogeneous many-core processor and the system architecture of the supercomputer. The parallel design is able to process large-scale clustering problems with up to 196,608 dimensions and over 160,000 targeting centroids, while maintaining high performance and high scalability. Furthermore, we propose an automatic hyper-parameter determination process for k-means clustering, by automatically generating and executing the clustering tasks with a set of candidate hyper-parameter, and then determining the optimal hyper-parameter using a proposed evaluation method. The proposed auto-clustering solution can not only achieve high performance and scalability for problems with massive high-dimensional data, but also support clustering without sufficient prior knowledge for the number of targeted clusters, which can potentially increase the scope of k-means algorithm to new application areas.

引用

页码：997 / 1008

页数：12

共 50 条

[31] Deep clustering of small molecules at large-scale via variational autoencoder embedding and K-means
Hadipour, Hamid
Liu, Chengyou
Davis, Rebecca
Cardona, Silvia T.
Hu, Pingzhao
BMC BIOINFORMATICS, 2022, 23 (SUPPL 4)
[32] Automatic Determination of K in Distributed K-Means Clustering
Kotary, Dinesh Kumar
Nanda, Satyasai Jagannath
2ND INTERNATIONAL CONFERENCE ON RECENT TRENDS IN ADVANCED COMPUTING ICRTAC -DISRUP - TIV INNOVATION , 2019, 2019, 165 : 556 - 564
[33] Enhancing Performance of Large-scale Electronic Structure Calculations with Many-core Computing
Ryu, Hoon
Jeong, Yosang
2016 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2016, : 142 - 143
[34] Heterogeneous Parallel and Distributed Optimization of K-means Algorithm on Sunway Supercomputer
Chen, Jiawei
Tan, Rong
Zhang, Yiwen
2017 15TH IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS AND 2017 16TH IEEE INTERNATIONAL CONFERENCE ON UBIQUITOUS COMPUTING AND COMMUNICATIONS (ISPA/IUCC 2017), 2017, : 931 - 937
[35] Very large-scale data classification based on K-means clustering and multi-kernel SVM
Tinglong Tang
Shengyong Chen
Meng Zhao
Wei Huang
Jake Luo
Soft Computing, 2019, 23 : 3793 - 3801
[36] Very large-scale data classification based on K-means clustering and multi-kernel SVM
Tang, Tinglong
Chen, Shengyong
Zhao, Meng
Huang, Wei
Luo, Jake
SOFT COMPUTING, 2019, 23 (11) : 3793 - 3801
[37] A Simple but Powerful Heuristic Method for Accelerating k-Means Clustering of Large-Scale Data in Life Science
Ichikawa, Kazuki
Morishita, Shinichi
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2014, 11 (04) : 681 - 692
[38] A sample-based hierarchical adaptive K-means clustering method for large-scale video retrieval
Liao, Kaiyang
Liu, Guizhong
Xiao, Li
Liu, Chaoteng
KNOWLEDGE-BASED SYSTEMS, 2013, 49 : 123 - 133
[39] Practical Privacy-Preserving MapReduce Based K-Means Clustering Over Large-Scale Dataset
Yuan, Jiawei
Tian, Yifan
IEEE TRANSACTIONS ON CLOUD COMPUTING, 2019, 7 (02) : 568 - 579
[40] Decentralized Thermal-Aware Task Scheduling for Large-Scale Many-Core Systems
Cui, Yingnan
Zhang, Wei
Chaturvedi, Vivek
He, Bingsheng
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2016, 24 (06) : 2075 - 2088

← 1 2 3 4 5 →