FASTCF: FPGA-based Accelerator for STochastic-Gradient-Descent-based Collaborative Filtering

被引:11
作者
Zhou, Shijie [1 ]
Kannan, Rajgopal [2 ]
Min, Yu [1 ]
Prasanna, Viktor K. [1 ]
机构
[1] Univ Southern Calif, Los Angeles, CA 90089 USA
[2] US Army, Res Lab, Los Angeles, CA 90094 USA
来源
PROCEEDINGS OF THE 2018 ACM/SIGDA INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE GATE ARRAYS (FPGA'18) | 2018年
基金
美国国家科学基金会;
关键词
Sparse matrix factorization; Training process; Bipartite graph representation; MATRIX FACTORIZATION;
D O I
10.1145/3174243.3174252
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Sparse matrix factorization using Stochastic Gradient Descent (SGD) is a popular technique for deriving latent features from observations. SGD is widely used for Collaborative Filtering (CF), itself a well-known machine learning technique for recommender systems. In this paper, we develop an FPGA-based accelerator, FASTCF, to accelerate the SGD-based CF algorithm. FASTCF consists of parallel, pipelined processing units which concurrently process distinct user ratings by accessing a shared on-chip buffer. We design FASTCF through a holistic analysis of the specific design challenges for the acceleration of SGD-based CF on FPGA. Based on our analysis of these design challenges, we develop a bipartite graph processing approach with a novel 3-level hierarchical partitioning scheme that enables conflict-minimizing scheduling and processing of on-chip feature vector data to significantly accelerate the processing of this bipartite graph. First, we develop a fast heuristic to partition the input graph into induced subgraphs; this enables FASTCF to efficiently buffer vertex data for reuse and completely hide communication overhead. Second, we partition all the edges of each subgraph into matchings to extract the maximum parallelism. Third, we schedule the execution of the edges inside each matching to reduce concurrent memory access conflicts to the shared on-chip buffer. Compared with non-optimized baseline designs, the hierarchical partitioning approach results in up to 60x data dependency reduction, 4.2x bank conflict reduction, and 15.4x speedup. We implement FASTCF based on state-of-the-art FPGA and evaluate its performance using three large real-life datasets. Experimental results show that FASTCF sustains a high throughput of up to 217 billion floating-point operations per second (GFLOPS). Compared with state-of-the-art multi-core and GPU implementations, FASTCF demonstrates 13.3x and 12.7x speedup, respectively.
引用
收藏
页码:259 / 268
页数:10
相关论文
共 35 条
[1]  
[Anonymous], 2016, P 25 ACM INT S HIGH
[2]  
Betkaoui B, 2012, IEEE INT CONF ASAP, P8, DOI 10.1109/ASAP.2012.30
[3]  
Brozovsky L., 2007, RECOMMENDER SYSTEM O
[4]  
Chen Bee-Chung., Latent Factor Models for Web Recommender Systems
[5]  
Chen R., 2014, P AS PAC WORKSH SYST
[6]  
Dean J., 2012, P ADV NEUR INF PROC, V1, P1223
[7]   The Netflix Recommender System: Algorithms, Business Value, and Innovation [J].
Gomez-Uribe, Carlos A. ;
Hunt, Neil .
ACM TRANSACTIONS ON MANAGEMENT INFORMATION SYSTEMS, 2016, 6 (04)
[8]  
Ham TJ, 2016, INT SYMP MICROARCH
[9]   CaffePresso: An Optimized Library for Deep Learning on Embedded Accelerator-based platforms [J].
Hegde, Gopalakrishna ;
Siddhartha ;
Ramasamy, Nachiappan ;
Kapre, Nachiket .
2016 INTERNATIONAL CONFERENCE ON COMPILERS, ARCHITECTURE AND SYNTHESIS FOR EMBEDDED SYSTEMS (CASES), 2016,
[10]  
Kaleem Rashid., 2015, Proceedings of the 8thWorkshop on General Purpose Processing Using GPUs, GPGPU 2015, P81, DOI DOI 10.1145/2716282.2716289