Sharing-aware Efficient Private Caching in Many-core Server Processors

被引:2
|
作者
Shukla, Sudhanshu [1 ]
Chaudhuri, Mainak [1 ]
机构
[1] Indian Inst Technol, Dept Comp Sci & Engn, Kanpur, Uttar Pradesh, India
关键词
Many-core server processors; Private victim caches; Sharing-aware private caching;
D O I
10.1109/ICCD.2017.85
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The general-purpose cache-coherent many-core server processors are usually designed with a per-core private cache hierarchy and a large shared multi-banked last-level cache (LLC). The round-trip latency and the volume of traffic through the on-die interconnect between the per-core private cache hierarchy and the shared LLC banks can be significantly large. As a result, optimized private caching is important in such architectures. Traditionally, the private cache hierarchy in these processors treats the private and the shared blocks equally. We, however, observe that elimination of all non-compulsory non-coherence core cache misses to a small subset of shared code and data blocks can save a large fraction of the core requests to the LLC indicating large potential for reducing the interconnect traffic in such architectures. We architect a specialized exclusive per-core private L2 cache which serves as a victim cache for the per-core private L1 cache. The proposed victim cache selectively captures a subset of the L1 cache victims. Our best selective victim caching proposal is driven by an online partitioning of the L1 cache victims based on two distinct features, namely, an estimate of sharing degree and an indirect simple estimate of reuse distance. Our proposal learns the collective reuse probability of the blocks in each partition on-the-fly and decides the victim caching candidates based on these probability estimates. Detailed simulation results on a 128-core system running a selected set of multi-threaded commercial and scientific computing applications show that our best victim cache design proposal at 64 KB capacity, on average, saves 44.1% core cache miss requests sent to the LLC and 10.6% execution cycles compared to a baseline system that has no private L2 cache. In contrast, a traditional 128 KB non-inclusive LRU L2 cache saves 42.2% core cache misses sent to the LLC compared to the same baseline while performing slightly worse than the proposed 64 KB victim cache. In summary, our proposal outperforms the traditional design and enjoys lower interconnect traffic while halving the space investment for the per-core private L2 cache. Further, the savings in core cache misses achieved due to introduction of the proposed victim cache are observed to be only 8% less than an optimal victim cache design at 32 KB and 64 KB capacity points.
引用
收藏
页码:485 / 492
页数:8
相关论文
共 50 条
  • [11] Workload-Aware Adaptive Power Delivery System Management for Many-Core Processors
    Li, Haoran
    Xu, Jiang
    Wang, Zhe
    Maeda, Rafael K., V
    Yang, Peng
    Tian, Zhongyuan
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2018, 37 (10) : 2076 - 2086
  • [12] A Load-aware Broadcast Scheme Supporting Rectangular Regions for Many-core Processors
    Liu, Xu
    Jiang, Jiang
    Zhu, Yongxin
    Wang, Chang
    Han, Xing
    PROCEEDINGS OF 2015 4TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT 2015), 2015, : 842 - 846
  • [13] Shared resource aware scheduling on power-constrained tiled many-core processors
    Jha, Sudhanshu Shekhar
    Heirman, Wim
    Falcon, Ayose
    Tubella, Jordi
    Gonzalez, Antonio
    Eeckhout, Lieven
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2017, 100 : 30 - 41
  • [14] Shared Resource Aware Scheduling on Power-Constrained Tiled Many-Core Processors
    Shekhar Jha, Sudhanshu
    Heirman, Wim
    Falcon, Ayose
    Tubella, Jordi
    Gonzalez, Antonio
    Eeckhout, Lieven
    PROCEEDINGS OF THE ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS (CF'16), 2016, : 365 - 368
  • [15] Fast Data Delivery for Many-Core Processors
    Bakhshalipour, Mohammad
    Lotfi-Kamran, Pejman
    Mazloumi, Abbas
    Samandi, Farid
    Naderan-Tahan, Mahmood
    Modarressi, Mehdi
    Sarbazi-Azad, Hamid
    IEEE TRANSACTIONS ON COMPUTERS, 2018, 67 (10) : 1416 - 1429
  • [16] Instruction Fusion for Multiscalar and Many-Core Processors
    Lu, Yaojie
    Ziavras, Sotirios G.
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2017, 45 (01) : 67 - 78
  • [17] Instruction Fusion for Multiscalar and Many-Core Processors
    Yaojie Lu
    Sotirios G. Ziavras
    International Journal of Parallel Programming, 2017, 45 : 67 - 78
  • [18] Emerging Applications for Multi/Many-Core Processors
    Lee, Victor W.
    Chen, Yen-Kuang
    Debuy, Pradeep
    2011 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2011, : 1524 - 1527
  • [19] A new power efficient high performance interconnection network for many-core processors
    Al Faisal, Faiz
    Rahman, M. M. Hafizur
    Inoguchi, Yasushi
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2017, 101 : 92 - 102
  • [20] An efficient numerical solution technique for VLSI interconnect equations on many-core processors
    Domnech-Asensi, Gins
    Kazmierski, Tom J.
    2019 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2019,