CVR: Efficient Vectorization of SpMV on X86 Processors

被引：53

作者：

Xie, Biwei ^{[1
]}

Zhan, Jianfeng ^{[1
]}

Liu, Xu ^{[2
]}

Gao, Wanling ^{[3
]}

Jia, Zhen ^{[4
]}

He, Xiwen ^{[5
]}

Zhang, Lixin ^{[5
]}

机构：

[1] Univ Chinese Acad Sci, Chinese Acad Sci, Inst Comp Technol, State Key Lab Comp Architecture, Beijing, Peoples R China

[2] Coll William & Mary, Dept Comp Sci, Williamsburg, VA 23187 USA

[3] Univ Chinese Acad Sci, Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China

[4] Princeton Univ, Dept Comp Sci, Princeton, NJ 08544 USA

[5] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 2018 INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO'18) | 2018年

基金：

国家自然科学基金重大项目; 美国国家科学基金会;

关键词：

SpMV; Xeon Phi; SIMD; Vectorization;

D O I：

10.1145/3168818

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Sparse Matrix-vector Multiplication (SpMV) is an important computation kernel widely used in HPC and data centers. The irregularity of SpMV is a well-known challenge that limits SpMV's parallelism with vectorization operations. Existing work achieves limited locality and vectorization efficiency with large preprocessing overheads. To address this issue, we present the Compressed Vectorization-oriented sparse Row (CVR), a novel SpMV representation targeting efficient vectorization. The CVR simultaneously processes multiple rows within the input matrix to increase cache efficiency and separates them into multiple SIMD lanes so as to take the advantage of vector processing units in modern processors. Our method is insensitive to the sparsity and irregularity of SpMV, and thus able to deal with various scale-free and HPC matrices. We implement and evaluate CVR on an Intel Knights Landing processor and compare it with five state-of-the-art approaches using 58 scale-free and HPC sparse matrices. Experimental results show that CVR can achieve a speedup up to 1.70 x (1.33x on average) and a speedup up to 1.57x (1.10x on average) over the best existing approaches for scale-free and HPC sparse matrices, respectively. Moreover, CVR typically incurs the lowest preprocessing overhead compared with state-of-the-art approaches.

引用

页码：149 / 162

页数：14

共 50 条

[1] X86 PROCESSORS
Dipert, Brian
EDN, 2010, 55 (03) : 19 - +
[2] Efficient SIMD Optimization of HEVC Encoder over X86 Processors
Chen, Keji
Duan, Yizhou
Yan, Leju
Sun, Jun
Guo, Zongming
2012 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2012,
[3] Fast Concurrent Queues for x86 Processors
Morrison, Adam
Afek, Yehuda
ACM SIGPLAN NOTICES, 2013, 48 (08) : 103 - 112
[4] A SUCCESSFUL DESIGN METHODOLOGY FOR X86 PROCESSORS
不详
ELECTRONIC ENGINEERING, 1993, 65 (801): : S39 - S41
[5] Optimizing precision overhead for x86 processors
Ogasawara, T
Komatsu, H
Nakatani, T
SOFTWARE-PRACTICE & EXPERIENCE, 2004, 34 (09): : 875 - 893
[6] MPTLsim: A Simulator for X86 Multicore Processors
Zeng, Hui
Yourst, Matt
Ghose, Kanad
Ponomarev, Dmitry
DAC: 2009 46TH ACM/IEEE DESIGN AUTOMATION CONFERENCE, VOLS 1 AND 2, 2009, : 226 - 231
[7] Optimizing precision overhead for x86 processors
Ogasawara, T
Komatsu, H
Nakatani, T
USENIX ASSOCIATION PROCEEDINGS OF THE 2ND JAVA(TM) VIRTUAL MACHINE RESEARCH AND TECHNOLOGY SYMPOSIUM, 2002, : 41 - 50
[8] A Statistical Approach to Power Estimation for x86 Processors
Chadha, Mohak
Ilsche, Thomas
Bielert, Mario
Nagel, Wolfgang E.
2017 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2017, : 1012 - 1019
[9] Investigating Cache Parameters of x86 Family Processors
Babka, Vlastimil
Tuma, Petr
COMPUTER PERFORMANCE EVALUATION AND BENCHMARKING, PROCEEDINGS, 2009, 5419 : 77 - 96
[10] DCUIP Poisoning Attack in Intel x86 Processors
Shin, Youngjoo
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2021, E104D (08) : 1386 - 1390

← 1 2 3 4 5 →