A Specialized Low-Cost Vectorized Loop Buffer for Embedded Processors

被引:0
作者
Huang, Libo [1 ]
Wang, Zhiying [1 ]
Shen, Li [1 ]
Lu, Hongyi [1 ]
Xiao, Nong [1 ]
Liu, Cong [1 ]
机构
[1] Natl Univ Def Technol, Sch Comp, Changsha 400073, Hunan, Peoples R China
来源
2011 DESIGN, AUTOMATION & TEST IN EUROPE (DATE) | 2011年
基金
中国国家自然科学基金;
关键词
PERFORMANCE; CACHE; POWER;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Current loop buffer has been mainly explored as an effective architectural technique for low-power execution in embedded processor. Another avenue, however, for exploiting loop buffer is to obtain its performance benefit. In this paper, we propose an application specific loop buffer organization for vectorized processing kernels, to achieve low-power and highperformance goals. The vectorized loop buffer (VLB) is simplified with single loop support for SIMD devices. Since significant data rearrangement overhead is required in order to use the SIMD capabilities, the VLB is specialized for zero-overhead implicit data permutation. We extend several instructions to the baseline ISA for programming and integrate it into an embedded processor for evaluation. Our results show that VLB improves the performance and power measures significantly compared to conventional SIMD devices.
引用
收藏
页码:1200 / 1203
页数:4
相关论文
共 12 条
  • [1] Effective hardware-based two-way loop cache for high performance low power processors
    Anderson, T
    Agarwala, S
    [J]. 2000 IEEE INTERNATIONAL CONFERENCE ON COMPUTER DESIGN: VLSI IN COMPUTERS & PROCESSORS, PROCEEDINGS, 2000, : 403 - 407
  • [2] Instruction buffering to reduce power in processors for signal processing
    Bajwa, RS
    Hiraki, M
    Kojima, H
    Gorny, DJ
    Nitta, K
    Shridhar, A
    Seki, K
    Sasaki, K
    [J]. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 1997, 5 (04) : 417 - 424
  • [3] Vectorization for SIMD Architectures with alignment constraints
    Eichenberger, AE
    Wu, P
    O'Brien, K
    [J]. ACM SIGPLAN NOTICES, 2004, 39 (06) : 82 - 93
  • [4] Synergistic processing in Cell's multicore architecture
    Gschwind, M
    Hofstee, HP
    Flachs, B
    Watanabe, Y
    Yamazaki, T
    [J]. IEEE MICRO, 2006, 26 (02) : 10 - 24
  • [5] Hajj NBI, 1998, 1998 INTERNATIONAL SYMPOSIUM ON LOW POWER ELECTRONICS AND DESIGN - PROCEEDINGS, P70, DOI 10.1109/LPE.1998.708158
  • [6] Huang LB, 2010, INT S HIGH PERF COMP, P355
  • [7] The filter cache: An energy efficient memory structure
    Kin, J
    Gupta, M
    Mangione-Smith, WH
    [J]. THIRTIETH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, PROCEEDINGS, 1997, : 184 - 193
  • [8] Naishlos Dorit., 2003, PROC INT C COMPILERS, P2
  • [9] Measuring the performance of multimedia instruction sets
    Slingerland, N
    Smith, AJ
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2002, 51 (11) : 1317 - 1332
  • [10] A simple video format for mobile applications
    Smith, JR
    Miao, ZR
    Li, CS
    [J]. IMAGE AND VIDEO COMMUNICATIONS AND PROCESSING 2000, 2000, 3974 : 260 - 269