Performance Modeling of Atomic Additions on GPU Scratchpad Memory

被引：24

作者：

Gomez-Luna, Juan ^{[1
]}

Maria Gonzalez-Linares, Jose ^{[2
]}

Benavides Benitez, Jose Ignacio ^{[1
]}

Guil Mata, Nicolas ^{[2
]}

机构：

[1] Univ Cordoba, Dept Comp Architecture & Elect, E-14071 Cordoba, Spain

[2] Univ Malaga, ETSI Informat, Dept Arquitectura Comp, E-29071 Malaga, Spain

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2013年 / 24卷 / 11期

关键词：

Performance model; atomic operations; shared memory; K-means; histogram; CUDA; GPU;

D O I：

10.1109/TPDS.2012.319

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

GPU application implementations using scatter approaches will fall into write contention due to atomic updates of output elements, if these result from more than one input element. Colliding threads will be serialized, seriously harming performance. Dealing with these issues requires a proper understanding of the behavior of the scratchpad or shared memory under conflicting accesses caused by concurrent threads. Thus, this paper presents an exhaustive microbenchmark-based analysis of atomic additions in shared memory that quantifies the impact of access conflicts on latency and throughput. This analysis has led us to discover the lock mechanism that enables atomic updates to shared memory and to propose a performance model to estimate the latency penalties due to collisions by position or bank conflicts. Then, we have derived experiments from this model that show us the way to optimize applications using atomic operations. Position and bank conflicts can be diminished by replication and padding, respectively. The benefits of such techniques are illustrated with the optimization of two widely used voting processes: the centroid updating step in k-means clustering, and histogram calculation.

引用

页码：2273 / 2282

页数：10

共 28 条

[1]

[Anonymous], 2011, CUDA C best practices guide

[2]

[Anonymous], 2016, Programming massively parallel processors: a hands-on approach

[3]

[Anonymous], 2011, OpenCL

[4]

[Anonymous], 2011, CUDA C PROGR GUID 4

[5]

[Anonymous], 2008, TECHNICAL REPORT

[6] Efficient Performance Evaluation of Memory Hierarchy for Highly Multithreaded Graphics Processors [J].

Baghsorkhi, Sara S. ;

Gelado, Isaac ;

Delahaye, Matthieu ;

Hwu, Wen-mei W. .

ACM SIGPLAN NOTICES, 2012, 47 (08) :23-33

[7] An Adaptive Performance Modeling Tool for GPU Architectures [J].

Baghsorkhi, Sara S. ;

Delahaye, Matthieu ;

Patel, Sanjay J. ;

Gropp, William D. ;

Hwu, Wen-mei W. .

ACM SIGPLAN NOTICES, 2010, 45 (05) :105-114

[8]

Bai Hong-tao, 2009, 2009 WRI World Congress on Computer Science and Information Engineering (CSIE 2009), P651, DOI 10.1109/CSIE.2009.491

[9] A performance study of general-purpose applications on graphics processors using CUDA [J].

Che, Shuai ;

Boyer, Michael ;

Meng, Jiayuan ;

Tarjan, David ;

Sheaffer, Jeremy W. ;

Skadron, Kevin .

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2008, 68 (10) :1370-1380

[10]

Coon B., 2011, US Patent, Patent No. 8055856

← 1 2 3 →