Performance portability in a real world application: PHAST applied to Caffe

被引:2
作者
Antonio Martinez, Pablo [1 ]
Peccerillo, Biagio [2 ]
Bartolini, Sandro [2 ]
Garcia, Jose M. [1 ]
Bernabe, Gregorio [1 ]
机构
[1] Univ Murcia, Comp Engn Dept, Murcia, Spain
[2] Univ Siena, Dept Informat Engn & Math, Siena, Italy
关键词
High-performance computing; performance portability; heterogeneous computing; machine learning;
D O I
10.1177/10943420221077107
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This work covers the PHAST Library's employment, a hardware-agnostic programming library, to a real-world application like the Caffe framework. The original implementation of Caffe consists of two different versions of the source code: one to run on CPU platforms and another one to run on the GPU side. With PHAST, we aim to develop a single-source code implementation capable of running efficiently on CPU and GPU. In this paper, we start by carrying out a basic Caffe implementation performance analysis using PHAST. Then, we detail possible performance upgrades. We find that the overall performance is dominated by few 'heavy' layers. In refining the inefficient parts of this version, we find two different approaches: improvements to the Caffe source code and improvements to the PHAST Library itself, which ultimately translates into improved performance in the PHAST version of Caffe. We demonstrate that our PHAST implementation achieves performance portability on CPUs and GPUs. With a single source, the PHAST version of Caffe provides the same or even better performance than the original version of Caffe built from two different codebases. For the MNIST database, the PHAST implementation takes an equivalent amount of time as native code in CPU and GPU. Furthermore, PHAST achieves a speedup of 51% and a 49% with the CIFAR-10 database against native code in CPU and GPU, respectively. These results provide a new horizon for software development in the upcoming heterogeneous computing era.
引用
收藏
页码:419 / 439
页数:21
相关论文
共 38 条
[1]  
Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2]  
Adve S., 2019, INFORM SCI TECHNOLOG, P27
[3]  
Alpay Aksel, 2019, HIPSYCL IMPL SYCL NV
[4]   Programming languages for data-Intensive HPC applications: A systematic mapping study [J].
Amaral, Vasco ;
Norberto, Beatriz ;
Goulao, Miguel ;
Aldinucci, Marco ;
Benkner, Siegfried ;
Bracciali, Andrea ;
Carreira, Paulo ;
Celms, Edgars ;
Correia, Luis ;
Grelck, Clemens ;
Karatza, Helen ;
Kessler, Christoph ;
Kilpatrick, Peter ;
Martiniano, Hugo ;
Mavridis, Ilias ;
Pllana, Sabri ;
Respicio, Ana ;
Simao, Jose ;
Veiga, Luis ;
Visa, Ari .
PARALLEL COMPUTING, 2020, 91
[5]   An updated set of Basic Linear Algebra Subprograms (BLAS) [J].
Blackford, LS ;
Demmel, J ;
Dongarra, J ;
Duff, I ;
Hammarling, S ;
Henry, G ;
Heroux, M ;
Kaufman, L ;
Lumsdaine, A ;
Petitet, A ;
Pozo, R ;
Remington, K ;
Whaley, RC .
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2002, 28 (02) :135-151
[6]   A Versatile Software Systolic Execution Model for GPU Memory-Bound Kernels [J].
Chen, Peng ;
Wahib, Mohamed ;
Takizawa, Shinichiro ;
Takano, Ryousei ;
Matsuoka, Satoshi .
PROCEEDINGS OF SC19: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2019,
[7]  
CodePlay, 2019, COMPUTECPP ACC COMPL
[8]   Efficiency and productivity for decision making on low-power heterogeneous CPU plus GPU SoCs [J].
Constantinescu, Denisa-Andreea ;
Navarro, Angeles ;
Corbera, Francisco ;
Fernandez-Madrigal, Juan-Antonio ;
Asenjo, Rafael .
JOURNAL OF SUPERCOMPUTING, 2021, 77 (01) :44-65
[9]   DESIGN OF ION-IMPLANTED MOSFETS WITH VERY SMALL PHYSICAL DIMENSIONS [J].
DENNARD, RH ;
GAENSSLEN, FH ;
YU, HN ;
RIDEOUT, VL ;
BASSOUS, E ;
LEBLANC, AR .
IEEE JOURNAL OF SOLID-STATE CIRCUITS, 1974, SC 9 (05) :256-268
[10]  
Dukhan M., 2019, EFF DEEP LEARN COMP, P10