Large-Scale Automatic K-Means Clustering for Heterogeneous Many-Core Supercomputer

被引：4

作者：

Yu, Teng ^{[1
]}

Zhao, Wenlai ^{[2
,3
]}

Liu, Pan ^{[2
,3
]}

Janjic, Vladimir ^{[1
]}

Yan, Xiaohan ^{[4
]}

Wang, Shicai ^{[5
]}

Fu, Haohuan ^{[2
,3
]}

Yang, Guangwen ^{[2
,3
]}

Thomson, John ^{[1
]}

机构：

[1] Univ St Andrews, St Andrews KY16 9AJ, Fife, Scotland

[2] Tsinghua Univ, Beijing 100084, Peoples R China

[3] Natl Supercomp Ctr, Wuxi 214072, Jiangsu, Peoples R China

[4] Univ Calif Berkeley, Berkeley, CA 94720 USA

[5] Wellcome Trust Sanger Inst, Saffron Walden CB10 1SA, Essex, England

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2020年 / 31卷 / 05期

基金：

英国工程与自然科学研究理事会; 国家重点研发计划; 中国博士后科学基金;

关键词：

Supercomputer; heterogeneous many-core processor; data partitioning; clustering; scheduling; AutoML; ALGORITHM;

D O I：

10.1109/TPDS.2019.2955467

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

This article presents an automatic k-means clustering solution targeting the Sunway TaihuLight supercomputer. We first introduce a multilevel parallel partition approach that not only partitions by dataflow and centroid, but also by dimension, which unlocks the potential of the hierarchical parallelism in the heterogeneous many-core processor and the system architecture of the supercomputer. The parallel design is able to process large-scale clustering problems with up to 196,608 dimensions and over 160,000 targeting centroids, while maintaining high performance and high scalability. Furthermore, we propose an automatic hyper-parameter determination process for k-means clustering, by automatically generating and executing the clustering tasks with a set of candidate hyper-parameter, and then determining the optimal hyper-parameter using a proposed evaluation method. The proposed auto-clustering solution can not only achieve high performance and scalability for problems with massive high-dimensional data, but also support clustering without sufficient prior knowledge for the number of targeted clusters, which can potentially increase the scope of k-means algorithm to new application areas.

引用

页码：997 / 1008

页数：12

共 50 条

[21] Large-scale k-means clustering with user-centric privacy-preservation
Sakuma, Jun
Kobayashi, Shigenobu
KNOWLEDGE AND INFORMATION SYSTEMS, 2010, 25 (02) : 253 - 279
[22] Hierarchical K-means Method for Clustering Large-Scale Advanced Metering Infrastructure Data
Xu, Tian-Shi
Chiang, Hsiao-Dong
Liu, Guang-Yi
Tan, Chin-Woo
IEEE TRANSACTIONS ON POWER DELIVERY, 2017, 32 (02) : 609 - 616
[23] Large-scale Parallel Design for Cryo-EM Structure Determination on Heterogeneous Many-core Architectures
Qiao, Liang
Yu, Hongkun
Wang, Kunpeng
Sun, Ruixin
Zhao, Wenlai
Fu, Haohuan
Yang, Guangwen
2019 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2019, : 711 - 716
[24] An OpenCL Compiler for the Homegrown Heterogeneous Many-Core Processor on the Sunway TaihuLight Supercomputer
Wu M.-C.
Huang L.
Liu Y.
He X.-B.
Feng X.-B.
Liu, Ying (liuying2007@ict.ac.cn), 2018, Science Press (41): : 2236 - 2250
[25] A Semantic Partition Algorithm Based on Improved K-Means Clustering for Large-Scale Indoor Areas
Shi, Kegong
Yan, Jinjin
Yang, Jinquan
ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION, 2024, 13 (02)
[26] K-means Clustering Algorithm for Large-scale Chinese Commodity Information Web Based on Hadoop
Geng Yushui
Zhang Lishuo
14TH INTERNATIONAL SYMPOSIUM ON DISTRIBUTED COMPUTING AND APPLICATIONS FOR BUSINESS, ENGINEERING AND SCIENCE (DCABES 2015), 2015, : 256 - 259
[27] Deep clustering of small molecules at large-scale via variational autoencoder embedding and K-means
Hamid Hadipour
Chengyou Liu
Rebecca Davis
Silvia T. Cardona
Pingzhao Hu
BMC Bioinformatics, 23
[28] Optimal Operation of Large-scale Electric Vehicles Based on Improved K-means Clustering Algorithm
Liu, Jian
Xu, Weifeng
Liu, Zhijun
Fu, Guanhua
Jiang, Yunpeng
Zhao, Ergang
PROCEEDINGS OF 2022 5TH INTERNATIONAL CONFERENCE ON ROBOT SYSTEMS AND APPLICATIONS, ICRSA2022, 2022, : 23 - 28
[29] Efficient adaptive large-scale text clustering method based on genetic K-means algorithm
Dai, Wenhua
Jiao, Cuizhen
He, Tingting
RECENT ADVANCE OF CHINESE COMPUTING TECHNOLOGIES, 2007, : 281 - 285
[30] A MapReduce-based parallel K-means clustering for large-scale CIM data verification
Deng, Chuang
Liu, Yang
Xu, Lixiong
Yang, Jie
Liu, Junyong
Li, Siguang
Li, Maozhen
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2016, 28 (11): : 3096 - 3114

← 1 2 3 4 5 →