Combining Data Duplication and Graph Reordering to Accelerate Parallel Graph Processing

被引：17

作者：

Balaji, Vignesh ^{[1
]}

Lucia, Brandon ^{[1
]}

机构：

[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

来源：

HPDC'19: PROCEEDINGS OF THE 28TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING | 2019年

基金：

美国国家科学基金会;

关键词：

Graph Processing; Atomics; Locality; Power-law; Data Duplication; FRAMEWORK;

D O I：

10.1145/3307681.3326609

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Performance of single-machine, shared memory graph processing is affected by expensive atomic updates and poor cache locality. Data duplication, a popular approach to eliminate atomic updates by creating thread-local copies of shared data, incurs extreme memory overheads due to the large sizes of typical input graphs. Even memory-efficient duplication strategies that exploit the power-law structure common to many graphs (by duplicating only the highly-connected "hub" vertices) suffer from overheads for having to dynamically identify the hub vertices. Degree Sorting, a popular graph reordering technique that re-assigns hub vertices consecutive IDs in a bid to improve spatial locality, is effective for single-threaded graph applications but suffers from increased false sharing in parallel executions. The main insight of this work is that the combination of data duplication and Degree Sorting eliminates the overheads of each optimization. Degree Sorting improves the efficiency of data duplication by assigning hub vertices consecutive IDs which enables easy identification of the hub vertices. Additionally, duplicating the hub vertex data eliminates false sharing in Degree Sorting since each thread updates its local copy of the hub vertex data. We evaluate this mutually-enabling combination of power-law-specific data duplication and Degree Sorting in a system called RADAR. RADAR improves performance by eliminating atomic updates for hub vertices and improving the cache locality of graph applications, providing speedups of up to 165x (1.88x on average) across different graph applications and input graphs.

引用

页码：133 / 144

页数：12

共 36 条

[1]

[Anonymous], 26 INT S HIGH PERF P

[2] Rabbit Order: Just-in-time Parallel Reordering for Fast Graph Analysis [J].

Arai, Junya ;

Shiokawa, Hiroaki ;

Yamamuro, Takeshi ;

Onizuka, Makoto ;

Iwamura, Sotetsu .

2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2016), 2016, :22-31

[3]

Balaji V, 2018, I S WORKL CHAR PROC, P203, DOI 10.1109/IISWC.2018.8573478

[4] Reducing Pagerank Communication via Propagation Blocking [J].

Beamer, Scott ;

Asanovic, Krste ;

Patterson, David .

2017 31ST IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2017, :820-831

[5] Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server [J].

Beamer, Scott ;

Asanovic, Krste ;

Patterson, David .

2015 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION (IISWC), 2015, :56-65

[6]

Besta Maciej, 2015, P 24 INT S HIGH PERF, P161, DOI [DOI 10.1145/2749246.2749263, 10.1145/2749246.2749263]

[7]

Boldi P., 2011, P 20 INT C WORLD WID, P587, DOI 10.1145/1963405.1963488.

[8]

BROOKS G, 1992, SIGPLAN NOTICES, V27, P1, DOI [10.13334/j.0258-8013.pcsee.213043, 10.1145/143103.143108]

[9]

Buono D., 2016, Proceedings of the 2016 International Conference on Supercomputing, P37

[10]

Chen R., 2015, P 10 EUR C COMP SYST, P1, DOI 10.1145/2741948.2741970

← 1 2 3 4 →