Distributed out-of-memory NMF on CPU/GPU architectures

被引:0
作者
Ismael Boureima
Manish Bhattarai
Maksim Eren
Erik Skau
Philip Romero
Stephan Eidenbenz
Boian Alexandrov
机构
[1] Los Alamos National Laboratory,Theoritical Divison
[2] Los Alamos National Laboratory,Computer, Computational, and Statistical Science Division
[3] Los Alamos National Laboratory,HPC Divison
来源
The Journal of Supercomputing | 2024年 / 80卷
关键词
NMF; Out-of-memory; Latent features; Model selection; Distributed processing; Parallel programming; Big data; Heterogeneous computing; GPU; CUDA; NCCL; Cupy;
D O I
暂无
中图分类号
学科分类号
摘要
We propose an efficient distributed out-of-memory implementation of the non-negative matrix factorization (NMF) algorithm for heterogeneous high-performance-computing systems. The proposed implementation is based on prior work on NMFk, which can perform automatic model selection and extract latent variables and patterns from data. In this work, we extend NMFk by adding support for dense and sparse matrix operation on multi-node, multi-GPU systems. The resulting algorithm is optimized for out-of-memory problems where the memory required to factorize a given matrix is greater than the available GPU memory. Memory complexity is reduced by batching/tiling strategies, and sparse and dense matrix operations are significantly accelerated with GPU cores (or tensor cores when available). Input/output latency associated with batch copies between host and device is hidden using CUDA streams to overlap data transfers and compute asynchronously, and latency associated with collective communications (both intra-node and inter-node) is reduced using optimized NVIDIA Collective Communication Library (NCCL) based communicators. Benchmark results show significant improvement, from 32X to 76x speedup, with the new implementation using GPUs over the CPU-based NMFk. Good weak scaling was demonstrated on up to 4096 multi-GPU cluster nodes with approximately 25,000 GPUs when decomposing a dense 340 Terabyte-size matrix and an 11 Exabyte-size sparse matrix of density 10-6\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^{-6}$$\end{document}.
引用
收藏
页码:3970 / 3999
页数:29
相关论文
共 171 条
[1]  
Lee DD(1999)Learning the parts of objects by non-negative matrix factorization Nature 401 788-791
[2]  
Seung HS(2013)Deciphering signatures of mutational processes operative in human cancer Cell Rep 3 246-259
[3]  
Alexandrov LB(2013)Signatures of mutational processes in human cancer Nature 500 415-101
[4]  
Nik-Zainal S(2020)The repertoire of mutational signatures in human cancer Nature 578 94-310
[5]  
Wedge DC(2019)Nonnegative tensor decomposition with custom clustering for microphase separation of block copolymers Stat Anal Data Min ASA Data Sci J 12 302-319
[6]  
Campbell PJ(2014)Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework J Global Optim 58 285-50
[7]  
Stratton MR(2015)Behavioral clusters in dynamic graphs Parallel Comput 47 38-290
[8]  
Alexandrov LB(2019)Software for sparse tensor decomposition on emerging computing architectures SIAM J Sci Comput 41 269-12
[9]  
Nik-Zainal S(2015)NMF-mGPU: non-negative matrix factorization on multi-GPU systems BMC Bioinf 16 1-11
[10]  
Wedge DC(2016)A high-performance parallel algorithm for nonnegative matrix factorization ACM SIGPLAN Not 51 1-37