Optimizing the LU Factorization for Energy Efficiency on a Many-Core Architecture

被引:4
作者
Garcia, Elkin [1 ]
Arteaga, Jaime [1 ]
Pavel, Robert [1 ]
Gao, Guang R. [1 ]
机构
[1] Univ Delaware, Dept Elect & Comp Engn, CAPSL, Newark, DE 19716 USA
来源
LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, LCPC 2013 | 2014年 / 8664卷
关键词
OPTIMIZATION; MODEL;
D O I
10.1007/978-3-319-09967-5_14
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Power consumption and energy efficiency have become a major bottleneck in the design of new systems for high performance computing. The path to exa-scale computing requires new strategies that decrease the energy consumption of modern many-core architectures without sacrificing scalability or performance. The development of these strategies demands the use of scalable models for energy consumption and the reorientation of optimization techniques to focus on energy efficiency, evaluating their trade-offs with respect to performance. In this paper, we investigate several optimization techniques to reduce the energy consumption on many-core architectures with a software-managed memory hierarchy. We study the impact of these techniques on the Static Energy and the Dynamic Energy of the LU factorization benchmark using a scalable energy consumption model. The main contributions of this paper are: (1) The modeling and analysis of energy consumption and energy efficiency for LU factorization; (2) the study and design of instruction-level and task-level optimizations for the reduction of the Static and Dynamic Energy; (3) the design and implementation of an energy aware tiling that decreases the Dynamic Energy of power hungry instructions in the LU factorization benchmark; and (4) the experimental evaluation of the scalability and improvement in terms of energy consumption and power efficiency of the proposed optimizations using the IBM Cyclops-64 many-core architecture. We study the trade-offs between performance and power efficiency for the proposed optimizations. Our results for the LU factorization benchmark, using 156 hardware thread units, show an improvement in power efficiency between 1.68X and 4.87X for different matrix sizes. In addition, we point out examples of optimizations that scale in performance but not necessarily in power efficiency.
引用
收藏
页码:237 / 251
页数:15
相关论文
共 23 条
  • [1] Energy optimization of multiprocessor systems on chip by voltage selection
    Andrei, Alexandru
    Eles, Petru
    Peng, Zebo
    Schmitz, Marcus T.
    Al Hashimi, Bashir M.
    [J]. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2007, 15 (03) : 262 - 275
  • [2] Bergman K., 2008, EXASCALE COMPUTING S
  • [3] Chen L., 2010, P 2010 SPRING SIM MU, P81
  • [4] Chen O.Y., 1995, LEC P 8 ANN INT C TE
  • [5] del Cuvillo J., 2005, Workshop on Modeling, Benchmarking, P11
  • [6] Denneau M., 2011, ENCY PARALLEL COMPUT, P145
  • [7] Hybrid static/dynamic scheduling for already optimized dense matrix factorization
    Donfack, Simplice
    Grigori, Laura
    Gropp, William D.
    Kale, Vivek
    [J]. 2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2012, : 496 - 507
  • [8] The LINPACK benchmark: past, present and future
    Dongarra, JJ
    Luszczek, P
    Petitet, A
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2003, 15 (09) : 803 - 820
  • [9] SOFTWARE LIBRARIES FOR LINEAR ALGEBRA COMPUTATIONS ON HIGH-PERFORMANCE COMPUTERS
    DONGARRA, JJ
    WALKER, DW
    [J]. SIAM REVIEW, 1995, 37 (02) : 151 - 180
  • [10] Garcia E., 2012, P 2012 WORKSH MULT A