Exploiting Parallelism and Vectorisation in Breadth-First Search for the Intel Xeon Phi

被引：2

作者：

Paredes, Mireya ^{[1
]}

Riley, Graham ^{[1
]}

Lujan, Mikel ^{[1
]}

机构：

[1] Univ Manchester, Sch Comp Sci, Kilburn Bldg,Oxford Rd, Manchester M13 9PL, Lancs, England

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2020年 / 31卷 / 01期

基金：

英国工程与自然科学研究理事会;

关键词：

Breadth-first search; graph algorithms; hybrid BFS; vectorisation; parallel architecture; Graph; 500; Xeon Phi;

D O I：

10.1109/TPDS.2019.2927451

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Modern applications generate massive amounts of data that is challenging to process or analyse. Graph algorithms have emerged as a solution for the analysis of such data because they can represent the entities participating in the generation of large-scale datasets in terms of vertices and their relationships in terms of edges. Graph analysis algorithms are used for finding patterns within these relationships, aiming to extract information to be further analysed. The breadth-first search (BFS) is one of the main graph search algorithms used for graph analysis and its optimisation has been widely researched using different parallel computers. However, the parallelisation of BFS has been shown to be challenging because of its inherent characteristics, including irregular memory access patterns, data dependencies and workload imbalance, that limit its scalability. This paper investigates the optimisation of the BFS on the Xeon Phi (Knights Corner), a modern parallel architecture provided with an advanced vector processor supporting the AVX-512 instruction set, using a bespoke development framework integrated with the Graph 500 benchmark. In addition, to demonstrate portability, we show results for a direct port of the algorithms to a more recent version of the Xeon Phi (Knights Landing) and to a Skylake CPU which supports most of the AVX-512 instruction set. Optimised parallel versions of two high-level algorithms for BFS were created using vectorisation, starting with the conventional top-down BFS algorithm and, building on this, a hybrid BFS algorithm. On the KNC our best implementations result in speedups of 1.37x (top-down) and 1.37x (hybrid), for a one million vertices graph, compared to the state-of-the-art. On the KNL and Skylake, the performance is higher than on KNC. In addition, we show results of our best hybrid algorithm on real-world graphs from the SNAP datasets with speedups up to 1.3x on KNC. Performance on KNL and Skylake is again higher, demonstrating the robustness and portability of our algorithm. The hybrid BFS algorithm can be further used to speed up other graph analysis algorithms and the lessons learned from vectorisation can be applied to other algorithms targeting existing and future models of the Xeon Phi and other advanced vector architectures.

引用

页码：111 / 128

页数：18

共 50 条

[1] Breadth First Search Vectorization on the Intel Xeon Phi
Paredes, Mireya
Riley, Graham
Lujan, Mikel
PROCEEDINGS OF THE ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS (CF'16), 2016, : 1 - 10
[2] FlexBFS: A Parallelism-aware Implementation of Breadth-First Search on GPU
Liu, Gu
An, Hong
Han, Wenting
Li, Xiaoqiang
Sun, Tao
Zhou, Wei
Wei, Xuechao
Tang, Xulong
ACM SIGPLAN NOTICES, 2012, 47 (08) : 279 - 280
[3] Breadth-First Search on Dynamic Graphs using Dynamic Parallelism on the GPU
Toedling, Dominik
Winter, Martin
Steinberger, Markus
2019 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2019,
[4] Breadth-First Search with A Multi-Core Computer
Belova, Maryia
Ouyang, Ming
2017 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2017, : 579 - 587
[5] Direction-optimizing breadth-first search
Beamer, Scott
Asanovic, Krste
Patterson, David
SCIENTIFIC PROGRAMMING, 2013, 21 (3-4) : 137 - 148
[6] Efficient Breadth-First Reduct Search
Boonjing, Veera
Chanvarasuth, Pisit
MATHEMATICS, 2020, 8 (05)
[7] Load-Balanced Breadth-First Search on GPUs
Zhu, Zhe
Li, Jianjun
Li, Guohui
WEB-AGE INFORMATION MANAGEMENT, WAIM 2014, 2014, 8485 : 435 - 447
[8] Parallelizability of the stack breadth-first search problem
Nakashima, T
Fujiwara, A
PDPTA'2001: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, 2001, : 722 - 727
[9] A parallel algorithm for the stack breadth-first search
Nakashima, T
Fujiwara, A
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2002, E85D (12) : 1955 - 1958
[10] Efficient distributed breadth-first search algorithm
Makki, SAM
COMPUTER COMMUNICATIONS, 1996, 19 (08) : 628 - 636

← 1 2 3 4 5 →