FP-NUCA: A Fast NOC Layer for Implementing Large NUCA Caches

被引：12

作者：

Arora, Anuj ^{[1
]}

Harne, Mayur ^{[3
]}

Sultan, Hameedah ^{[2
]}

Bagaria, Akriti ^{[1
]}

Sarangi, Smruti R. ^{[1
]}

机构：

[1] Indian Inst Technol, Dept Comp Sci & Engn, New Delhi 110016, India

[2] Indian Inst Technol, Dept Elect Engn, New Delhi 110016, India

[3] NVIDIA Inc, Panchshil Tech Pk, Pune 411005, Maharashtra, India

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2015年 / 26卷 / 09期

关键词：

NUCA caches; freeze router; bank prediction; EXPRESS VIRTUAL CHANNELS; REPLICATION; MODEL;

D O I：

10.1109/TPDS.2014.2358231

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

NUCA caches have traditionally been proposed as a solution for mitigating wire delays, and delays introduced due to complex networks on chip. Traditional approaches have reported significant performance gains with intelligent block placement, location, replication, and migration schemes. In this paper, we propose a novel approach in this space, called FP-NUCA. It differs from conventional approaches, and relies on a novel method of co-designing the last level cache and the network on chip. We artificially constrain the communication pattern in the NUCA cache such that all the messages travel along a few predefined paths (fast paths) for each set of banks. We leverage this communication pattern by designing a new type of NOC router called the Freeze router, which augments a regular router by adding a layer of circuitry that gates the clock of the regular router when there is a fast path message waiting to be transmitted. Messages along the fast path do not require buffering, switching, or routing. We incorporate a bank predictor with our novel NOC for reducing the number of messages, and resultant energy consumption. We compare our performance with state of the art protocols, and report speedups of up to 31 percent (mean: 6.3 percent), and ED2 reduction up to 46 percent (mean: 10.4 percent) for a suite of Splash and Parsec benchmarks. We implement the Freeze router in VHDL and show that the additional fast path logic has minimal area and timing overheads. Index Terms-NUCA caches, freeze

引用

页码：2465 / 2478

页数：14

共 32 条

[1] GARNET: A Detailed On-Chip Network Model inside a Full-System Simulator [J].

Agarwal, Niket ;

Krishna, Tushar ;

Peh, Li-Shiuan ;

Jha, Niraj K. .

ISPASS 2009: IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE, 2009, :33-42

[2]

Almasi George., 2002, Proceedings of the ACM SIGPLAN Workshop on Memory System Performance, P37

[3]

Anwen Huang, 2012, 2012 IEEE 7th International Conference on Networking, Architecture, and Storage (NAS), P181, DOI 10.1109/NAS.2012.27

[4] The PARSEC Benchmark Suite: Characterization and Architectural Implications [J].

Bienia, Christian ;

Kumar, Sanjeev ;

Singh, Jaswinder Pal ;

Li, Kai .

PACT'08: PROCEEDINGS OF THE SEVENTEENTH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, 2008, :72-81

[5] Memory bank predictors [J].

Bieschewski, S ;

Parcerisa, JM ;

González, A .

2005 IEEE International Conference on Computer Design: VLSI in Computers & Processors, Proceedings, 2005, :666-668

[6] Networks-on-Chip in Emerging Interconnect Paradigms: Advantages and Challenges [J].

Carloni, Luca P. ;

Pande, Partha ;

Xie, Yuan .

2009 3RD ACM/IEEE INTERNATIONAL SYMPOSIUM ON NETWORKS-ON-CHIP, 2009, :93-+

[7]

Chen CHO, 2013, DES AUT TEST EUROPE, P338

[8] Optimizing replication, communication, and capacity allocation in CMPs [J].

Chishti, Z ;

Powell, MD ;

Vijaykumar, TN .

32ND INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, PROCEEDINGS, 2005, :357-368

[9]

Chishti Z, 2003, 36TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, PROCEEDINGS, P55

[10]

Dally W. J., 2004, Principles and Practices of Interconnection Networks

← 1 2 3 4 →