BALS: Blocked Alternating Least Squares for Parallel Sparse Matrix Factorization on GPUs

被引：7

作者：

Chen, Jing ^{[1
]}

Fang, Jianbin ^{[1
]}

Liu, Weifeng ^{[2
]}

Yang, Canqun ^{[1
]}

机构：

[1] Natl Univ Def Technol, Coll Comp, Changsha 410073, Hunan, Peoples R China

[2] China Univ Petr, Dept Comp Sci & Technol, Super Sci Software Lab, Beijing 102249, Peoples R China

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2021年 / 32卷 / 09期

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

Sparse matrices; Motion pictures; Two dimensional displays; Graphics processing units; Artificial neural networks; Matrix decomposition; Logic gates; Matrix factorization; alternating least squares; data reuse; data reordering; performance evaluation; GPGPUs;

D O I：

10.1109/TPDS.2021.3064942

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Matrix factorization on sparse matrices has been proven to be an effective approach for data mining and machine learning. However, the prior parallel implementations for matrix factorization fail to capture the internal social property embedded in real-world use cases. This article presents an efficient implementation of the alternative least squares (ALS) algorithm called BALSbuilt on top of a new sparse matrix format for parallel matrix factorization. The BALS storage format organizes the sparse matrix into 2D tiles to avoid repeated data loads and improve data reuses. We further propose a data reordering technique to sort sparse matrices according to nonzeros. The experimental results show that BALS can yield a superior performance than state-of-the-art implementations, i.e., our BALS generally runs faster than Gates' implementation over different latent feature sizes, with a speedup of up to 2.08x on K20C, 3.72x on TITAN X and 3.13x on TITAN RTX. When compared with alternative matrix factorization algorithms, our BALS consistently outperforms CDMF, cuMF_CCD, and cuMF_SGD over various latent feature sizes and datasets. The reordering technique can provide an extra improvement of up to 23.68 percent on K20C, 19.87 percent on TITAN X and 20.38 percent on TITAN RTX.

引用

页码：2291 / 2302

页数：12

共 35 条

[1]

Agarwal Alekh, 2011, Advances in neural information processing systems, V24

[2]

[Anonymous], 2016, P 25 ACM INT S HIGH

[3]

[Anonymous], 2011, ADV NEURAL INFORM PR

[4]

[Anonymous], 2014, ARXIV14110602

[5] Efficient and Portable ALS Matrix Factorization for Recommender Systems [J].

Chen, Jing ;

Fang, Jianbin ;

Liu, Weifeng ;

Tang, Tao ;

Chen, Xuhao ;

Yang, Canqun .

2017 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2017, :409-418

[6] Accelerated parallel and distributed algorithm using limited internal memory for nonnegative matrix factorization [J].

Duy Khuong Nguyen ;

Tu Bao Ho .

JOURNAL OF GLOBAL OPTIMIZATION, 2017, 68 (02) :307-328

[7]

Gates M, 2015, PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, P667, DOI 10.1109/BigData.2015.7363811

[8]

Gemulla R, 2011, P 17 ACM SIGKDD INT, P69

[9]

Jin Z., 2021, P IEEE INT PAR DISTR

[10]

Kaleem Rashid., 2015, Proceedings of the 8thWorkshop on General Purpose Processing Using GPUs, GPGPU 2015, P81, DOI DOI 10.1145/2716282.2716289

← 1 2 3 4 →