Accelerating Stochastic Gradient Descent Based Matrix Factorization on FPGA

被引：10

作者：

Zhou, Shijie ^{[1
]}

Kannan, Rajgopal ^{[2
]}

Prasanna, Viktor K. ^{[3
]}

机构：

[1] Microsoft Corp, Redmond, WA 98052 USA

[2] US Army, Res Lab, Los Angeles, CA 90094 USA

[3] Univ Southern Calif, Dept Elect & Comp Engn, Los Angeles, CA 90089 USA

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2020年 / 31卷 / 08期

基金：

美国国家科学基金会;

关键词：

Field programmable gate arrays; Acceleration; Training; System-on-chip; Optimization; Partitioning algorithms; Bipartite graph; Machine learning; sparse matrix factorization; training acceleration; bipartite graph representation; FPGA accelerator; EFFICIENT;

D O I：

10.1109/TPDS.2020.2974744

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Matrix Factorization (MF) based on Stochastic Gradient Descent (SGD) is a powerful machine learning technique to derive hidden features of objects from observations. In this article, we design a highly parallel architecture based on Field-Programmable Gate Array (FPGA) to accelerate the training process of the SGD-based MF algorithm. We identify the challenges for the acceleration and propose novel algorithmic optimizations to overcome them. By transforming the SGD-based MF algorithm into a bipartite graph processing problem, we propose a 3-level hierarchical partitioning scheme that enables conflict-minimizing scheduling and processing of edges to achieve significant speedup. First, we develop a fast heuristic graph partitioning approach to partition the bipartite graph into induced subgraphs; this enables to efficiently use the on-chip memory resources of FPGA for data reuse and completely hide the data communication between FPGA and external memory. Second, we partition all the edges of each subgraph into non-overlapping matchings to extract the maximum parallelism. Third, we propose a batching algorithm to schedule the execution of the edges inside each matching to reduce the memory access conflicts to the on-chip RAMs of FPGA. Compared with non-optimized FPGA-based baseline designs, the proposed optimizations result in up to 60x data dependency reduction, 4.2x bank conflict reduction, and 15.4x speedup. We evaluate the performance of our design using a state-of-the-art FPGA device. Experimental results show that our FPGA accelerator sustains a high computing throughput of up to 217 billion floating-point operations per second (GFLOPS) for training very large real-life sparse matrices. Compared with highly-optimized GPU-based accelerators, our FPGA accelerator achieves up to 12.7x speedup. Based on our optimization methodology, we also implement a software-based design on a multi-core platform, which demonstrates 1.3x speedup compared with the state-of-the-art multi-core implementation.

引用

页码：1897 / 1911

页数：15

共 50 条

[1] Matrix Factorization Based Collaborative Filtering with Resilient Stochastic Gradient Descent
Abdelbar, Ashraf M.
Elnabarawy, Islam
Salama, Khalid M.
Wunsch, Donald C., II
2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
[2] Parallelizing Stochastic Gradient Descent with Hardware Transactional Memory for Matrix Factorization
Wu, Zhenwei
Luo, Yingqi
Lu, Kai
Wang, Xiaoping
2018 3RD INTERNATIONAL CONFERENCE ON INFORMATION SYSTEMS ENGINEERING (ICISE), 2018, : 118 - 121
[3] Efficient Parallel Stochastic Gradient Descent for Matrix Factorization Using GPU
Nassar, Mohamed A.
El-Sayed, Layla A. A.
Taha, Yousry
2016 11TH INTERNATIONAL CONFERENCE FOR INTERNET TECHNOLOGY AND SECURED TRANSACTIONS (ICITST), 2016, : 63 - 68
[4] GPUSGD: A GPU-accelerated stochastic gradient descent algorithm for matrix factorization
Jin, Jing
Lai, Siyan
Hu, Su
Lin, Jing
Lin, Xiaola
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2016, 28 (14): : 3844 - 3865
[5] CuMF_SGD: Parallelized Stochastic Gradient Descent for Matrix Factorization on GPUs
Xie, Xiaolong
Tan, Wei
Fong, Liana L.
Liang, Yun
HPDC'17: PROCEEDINGS OF THE 26TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, 2017, : 79 - 92
[6] Comparison Between Stochastic Gradient Descent and VLE Metaheuristic for Optimizing Matrix Factorization
Gomez-Pulido, Juan A.
Cortes-Toro, Enrique
Duran-Dominguez, Arturo
Lanza-Gutierrez, Jose M.
Crawford, Broderick
Soto, Ricardo
OPTIMIZATION AND LEARNING, 2020, 1173 : 153 - 164
[7] Convergence of Alternating Gradient Descent for Matrix Factorization
Ward, Rachel
Kolda, Tamara G.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[8] An Efficient Approach of GPU-accelerated Stochastic Gradient Descent Method for Matrix Factorization
Li, Feng
Ye, Yunming
Li, Xutao
JOURNAL OF INTERNET TECHNOLOGY, 2019, 20 (04): : 1087 - 1097
[9] A NEW APPROACH OF GPU-ACCELERATED STOCHASTIC GRADIENT DESCENT METHOD FOR MATRIX FACTORIZATION
Li, Feng
Ye, Yunming
Li, Xutao
Lu, Jiajie
INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2019, 15 (02): : 697 - 711
[10] Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation
Bogoychev, Nikolay
Junczys-Dowmunt, Marcin
Heafield, Kenneth
Aji, Alham Fikri
2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 2991 - 2996

← 1 2 3 4 5 →