Compiler-Assisted Compaction/Restoration of SIMD Instructions

被引:1
|
作者
Cebrian, Juan M. [1 ]
Balem, Thibaud [2 ]
Barredo, Adrian [3 ]
Casas, Marc [3 ]
Moreto, Miquel [3 ]
Ros, Alberto [1 ]
Jimborean, Alexandra [1 ]
机构
[1] Univ Murcia, Comp Engn Dept, E-30100 Murcia, Spain
[2] ENS Rennes, F-35170 Rennes, France
[3] Barcelona Supercomp Ctr, Barcelona 08034, Spain
基金
欧洲研究理事会; 欧盟第七框架计划;
关键词
Registers; Parallel processing; Hardware; Computer architecture; Out of order; Delays; Energy consumption; SIMD; predication; LLVM; density-time performance;
D O I
10.1109/TPDS.2021.3091015
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Vector processors (e.g., SIMD or GPUs) are ubiquitous in high performance systems. All the supercomputers in the world exploit data-level parallelism (DLP), for example by using single instructions to operate over several data elements. Improving vector processing is therefore key for exascale computing. However, despite its potential, vector code generation and execution have significant challenges. Among these challenges, control flow divergence is one of the main performance limiting factors. Most modern vector instruction sets, including SIMD, rely on predication to support divergence control. Nevertheless, the performance and energy consumption in predicated codes is usually insensitive to the number of active elements in a predicated mask. Since the trend is that vector register size increases, the energy efficiency of exascale computing systems will become sub-optimal. This article proposes a novel approach to improve execution efficiency in predicated vector codes, the Compiler-Assisted Compaction/Restoration (CACR) technique. Baseline CR delays predicated SIMD instructions with inactive elements, compacting active elements from instances of the same instruction of consecutive loop iterations. Compacted elements form an equivalent dense vector instruction. After executing the dense instructions, their results are restored to the original instructions. However, CR has a significant performance and energy penalty when it fails to find active elements, either due to lack of resources when unrolling or because of inter-loop dependencies. In CACR, the compiler analyzes the code looking for key information required to configure CR. Then, it passes this information to the processor via new instructions inserted in the code. This prevents CR from waiting for active elements on scenarios when it would fail to form dense instructions. Simulated results (gem5) show that CACR improves performance by up to 29 percent and reduces dynamic energy by up to 24.2 percent on average, for a a set of applications with predicated execution. The baseline CR only achieves 18.6 percent performance and 14 percent energy improvements for the same configuration and applications.
引用
收藏
页码:779 / 791
页数:13
相关论文
共 50 条
  • [31] Compiler-Assisted Value Correlation for Indirect Branch Prediction
    Tan Mingxing
    Liu Xianhua
    Zhang Jiyu
    Tong Dong
    Cheng Xu
    CHINESE JOURNAL OF ELECTRONICS, 2012, 21 (03): : 414 - 418
  • [32] Compiler-Assisted, Selective Out-Of-Order Commit
    Duong, Nam
    Veidenbaum, Alexander V.
    IEEE COMPUTER ARCHITECTURE LETTERS, 2013, 12 (01) : 21 - 24
  • [33] Compiler-assisted Operator Template Library for DNN Accelerators
    Li, Jiansong
    Cao, Wei
    Dong, Xiao
    Li, Guangli
    Wang, Xueying
    Zhao, Peng
    Liu, Lei
    Feng, Xiaobing
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2021, 49 (05) : 628 - 645
  • [34] Lightweight, Multi-Stage, Compiler-Assisted Application Specialization
    Alhanahnah, Mohannad
    Jain, Rithik
    Rastogi, Vaibhav
    Jha, Somesh
    Reps, Thomas
    2022 IEEE 7TH EUROPEAN SYMPOSIUM ON SECURITY AND PRIVACY (EUROS&P 2022), 2022, : 251 - 269
  • [35] Prefetch mechanism in compiler-assisted S-DSM system
    Niwa, J
    2004 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING WORKSHOPS, PROCEEDINGS, 2004, : 520 - 529
  • [36] Compiler-assisted cache replacement: Problem formulation and performance evaluation
    Yang, HB
    Govindarajan, R
    Gao, GR
    Hu, Z
    LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, 2004, 2958 : 77 - 92
  • [37] Compiler-Assisted Threshold Implementation Against Power Analysis Attacks
    Luo, Pei
    Athanasiou, Konstantinos
    Zhang, Liwei
    Jiang, Zhen Hang
    Fei, Yunsi
    Ding, A. Adam
    Wahl, Thomas
    2017 IEEE 35TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD), 2017, : 541 - 544
  • [38] Compiler-Assisted Data Distribution and Network Configuration for Chip Multiprocessors
    Li, Yong
    Abousamra, Ahmed
    Melhem, Rami
    Jones, Alex K.
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2012, 23 (11) : 2058 - 2066
  • [39] Compiler-assisted generation of error-detecting parallel programs
    RoyChowdhury, A
    Banerjee, P
    PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL SYMPOSIUM ON FAULT-TOLERANT COMPUTING, 1996, : 360 - 369
  • [40] Optimization of AI SoC with Compiler-assisted Virtual Design Platform
    Huang, Chih-Tsun
    Lu, Juin-Ming
    Chen, Yao-Hua
    Tung, Ming-Chih
    Chang, Shih-Chieh
    PROCEEDINGS OF THE 2023 INTERNATIONAL SYMPOSIUM ON PHYSICAL DESIGN, ISPD 2023, 2023, : 187 - 193