Accelerating Stochastic Gradient Descent Based Matrix Factorization on FPGA

被引:10
|
作者
Zhou, Shijie [1 ]
Kannan, Rajgopal [2 ]
Prasanna, Viktor K. [3 ]
机构
[1] Microsoft Corp, Redmond, WA 98052 USA
[2] US Army, Res Lab, Los Angeles, CA 90094 USA
[3] Univ Southern Calif, Dept Elect & Comp Engn, Los Angeles, CA 90089 USA
基金
美国国家科学基金会;
关键词
Field programmable gate arrays; Acceleration; Training; System-on-chip; Optimization; Partitioning algorithms; Bipartite graph; Machine learning; sparse matrix factorization; training acceleration; bipartite graph representation; FPGA accelerator; EFFICIENT;
D O I
10.1109/TPDS.2020.2974744
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Matrix Factorization (MF) based on Stochastic Gradient Descent (SGD) is a powerful machine learning technique to derive hidden features of objects from observations. In this article, we design a highly parallel architecture based on Field-Programmable Gate Array (FPGA) to accelerate the training process of the SGD-based MF algorithm. We identify the challenges for the acceleration and propose novel algorithmic optimizations to overcome them. By transforming the SGD-based MF algorithm into a bipartite graph processing problem, we propose a 3-level hierarchical partitioning scheme that enables conflict-minimizing scheduling and processing of edges to achieve significant speedup. First, we develop a fast heuristic graph partitioning approach to partition the bipartite graph into induced subgraphs; this enables to efficiently use the on-chip memory resources of FPGA for data reuse and completely hide the data communication between FPGA and external memory. Second, we partition all the edges of each subgraph into non-overlapping matchings to extract the maximum parallelism. Third, we propose a batching algorithm to schedule the execution of the edges inside each matching to reduce the memory access conflicts to the on-chip RAMs of FPGA. Compared with non-optimized FPGA-based baseline designs, the proposed optimizations result in up to 60x data dependency reduction, 4.2x bank conflict reduction, and 15.4x speedup. We evaluate the performance of our design using a state-of-the-art FPGA device. Experimental results show that our FPGA accelerator sustains a high computing throughput of up to 217 billion floating-point operations per second (GFLOPS) for training very large real-life sparse matrices. Compared with highly-optimized GPU-based accelerators, our FPGA accelerator achieves up to 12.7x speedup. Based on our optimization methodology, we also implement a software-based design on a multi-core platform, which demonstrates 1.3x speedup compared with the state-of-the-art multi-core implementation.
引用
收藏
页码:1897 / 1911
页数:15
相关论文
共 50 条
  • [1] Matrix Factorization Based Collaborative Filtering with Resilient Stochastic Gradient Descent
    Abdelbar, Ashraf M.
    Elnabarawy, Islam
    Salama, Khalid M.
    Wunsch, Donald C., II
    2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
  • [2] Parallelizing Stochastic Gradient Descent with Hardware Transactional Memory for Matrix Factorization
    Wu, Zhenwei
    Luo, Yingqi
    Lu, Kai
    Wang, Xiaoping
    2018 3RD INTERNATIONAL CONFERENCE ON INFORMATION SYSTEMS ENGINEERING (ICISE), 2018, : 118 - 121
  • [3] Efficient Parallel Stochastic Gradient Descent for Matrix Factorization Using GPU
    Nassar, Mohamed A.
    El-Sayed, Layla A. A.
    Taha, Yousry
    2016 11TH INTERNATIONAL CONFERENCE FOR INTERNET TECHNOLOGY AND SECURED TRANSACTIONS (ICITST), 2016, : 63 - 68
  • [4] GPUSGD: A GPU-accelerated stochastic gradient descent algorithm for matrix factorization
    Jin, Jing
    Lai, Siyan
    Hu, Su
    Lin, Jing
    Lin, Xiaola
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2016, 28 (14): : 3844 - 3865
  • [5] CuMF_SGD: Parallelized Stochastic Gradient Descent for Matrix Factorization on GPUs
    Xie, Xiaolong
    Tan, Wei
    Fong, Liana L.
    Liang, Yun
    HPDC'17: PROCEEDINGS OF THE 26TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, 2017, : 79 - 92
  • [6] Comparison Between Stochastic Gradient Descent and VLE Metaheuristic for Optimizing Matrix Factorization
    Gomez-Pulido, Juan A.
    Cortes-Toro, Enrique
    Duran-Dominguez, Arturo
    Lanza-Gutierrez, Jose M.
    Crawford, Broderick
    Soto, Ricardo
    OPTIMIZATION AND LEARNING, 2020, 1173 : 153 - 164
  • [7] Convergence of Alternating Gradient Descent for Matrix Factorization
    Ward, Rachel
    Kolda, Tamara G.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [8] An Efficient Approach of GPU-accelerated Stochastic Gradient Descent Method for Matrix Factorization
    Li, Feng
    Ye, Yunming
    Li, Xutao
    JOURNAL OF INTERNET TECHNOLOGY, 2019, 20 (04): : 1087 - 1097
  • [9] A NEW APPROACH OF GPU-ACCELERATED STOCHASTIC GRADIENT DESCENT METHOD FOR MATRIX FACTORIZATION
    Li, Feng
    Ye, Yunming
    Li, Xutao
    Lu, Jiajie
    INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2019, 15 (02): : 697 - 711
  • [10] Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation
    Bogoychev, Nikolay
    Junczys-Dowmunt, Marcin
    Heafield, Kenneth
    Aji, Alham Fikri
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 2991 - 2996