Effective High-Level Synthesis for High-Performance Graph Processing

被引：0

作者：

Tang J. ^{[1
,2
,3
,4
]}

Zheng L. ^{[1
,2
,3
,4
]}

Liao X. ^{[1
,2
,3
,4
]}

Jin H. ^{[1
,2
,3
,4
]}

机构：

[1] College of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan

[2] National Engineering Research Center for Big Data Technology and System, Huazhong University of Science and Technology, Wuhan

[3] Key Laboratory of Services Computing Technology and System, Huazhong University of Science and Technology, Ministry of Education, Wuhan

[4] Key Laboratory of Cluster and Grid Computing, Huazhong University of Science and Technology, Wuhan

来源：

Jisuanji Yanjiu yu Fazhan/Computer Research and Development | 2021年 / 58卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Dataflow architecture; FPGA; Graph processing; High level synthesis; Intermediate representation;

D O I：

10.7544/issn1000-1239.2021.20190679

中图分类号：

学科分类号：

摘要：

Graph processing has become one of the mainstream big data applications. For graph applications such as biological networks, social networks, and Web graphs, traditional GPU and CPU architectures suffer in terms of power consumption and performance due to graph algorithms' characteristics. It is demonstrated that specialized hardware acceleration can significantly promote the performance and energy-efficiency of graph processing. As we know, writing and verifying the correct hardware-level codes are tedious and time-consuming. Although general-purpose high level synthesis (HLS) systems allow users to write the applications using high-level languages such as C by automatically generating it into the underlying hardware codes. However, for the irregular graph applications, these HLS systems still lack effective support for massive parallelism and memory access, potentially leading to significantly low performance. In this paper, we propose an effective HLS for high-performance graph processing. We adopt the dataflow architecture to achieve efficient parallel pipelining, ensuring load balancing. Through the developed programming primitives, users can quickly customize the vertex-centric graph algorithm and translate it into a modular intermediate representation (IR), which in turn maps to a parameterized hardware template. We build our HLS on Xilinx Virtex UltraScale+XCVU9P. Results on a variety of graph algorithms and datasets show that our HLS system can outperform state-of-the-art spatial by 7.9-30.6x speedups. © 2021, Science Press. All right reserved.

引用

页码：467 / 478

页数：11

共 25 条

[1]

Ozdal M M, Yesil S, Kim T, Et al., Energy efficient architecture for graph analytics accelerators, Proc of the 43rd Annual ACM/IEEE Int Symp on Computer Architecture (ISCA), pp. 166-177, (2016)

[2]

Beamer S, Asanovic K, Patterson D., Locality exists in graph processing: Workload characterization on an Ivy bridge server, Proc of IEEE Int Symp on Workload Characterization (IISWC), pp. 56-65, (2015)

[3]

Garland M, Kirk D B., Understanding throughput-oriented architectures, Communications of the ACM, 53, 11, pp. 58-66, (2010)

[4]

O'Neil M A, Burtscher M., Microarchitectural performance characterization of irregular GPU kernels, Proc of IEEE Int Symp on Workload Characterization (IISWC), pp. 130-139, (2014)

[5]

Yao Pengcheng, Zheng Long, Liao Xiaofei, Et al., An efficient graph accelerator with parallel data conflict management, Proc of the 27th Int Conf on Parallel Architectures and Compilation Techniques, (2018)

[6]

Ham T J, Wu L, Sundaram N, Et al., Graphicionado: A high-performance and energy-efficient accelerator for graph analytics, Proc of the 49th Annual IEEE/ACM Int Symp on Microarchitecture (MICRO), pp. 1-13, (2016)

[7]

Dai Guohao, Huang Tianhao, Chi Yuze, Et al., Foregraph: Exploring large-scale graph processing on multi-FPGA architecture, Proc of the 2017 ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, pp. 217-226, (2017)

[8]

Zhou Shijie, Chelmis C, Prasanna V K., High-throughput and energy-efficient graph processing on FPGA, Proc of the 24th Annual IEEE Int Symp on Field-Programmable Custom Computing Machines (FCCM), pp. 103-110, (2016)

[9]

Bacon D F, Rabbah R M, Shukla S., FPGA programming for the masses, Communications of the ACM, 56, 4, pp. 56-63, (2013)

[10]

Bachrach J, Vo H, Richards B, Et al., Chisel:Constructing hardware in a scala embedded language, Proc of Design Automation Conf (DAC 2012), pp. 1212-1221, (2012)

← 1 2 3 →