Path-Based Partitioning Methods for 3D Networks-on-Chip with Minimal Adaptive Routing

被引：51

作者：

Ebrahimi, Masoumeh ^{[1
]}

Daneshtalab, Masoud ^{[1
]}

Liljeberg, Pasi ^{[1
]}

Plosila, Juha ^{[1
]}

Flich, Jose ^{[2
]}

Tenhunen, Hannu ^{[1
]}

机构：

[1] Univ Turku, Dept Informat Technol, FIN-20520 Turku, Finland

[2] Univ Politecn Valencia, Escuela Tecn Super Ingn Informat, Dept Informat Sistemas & Comp, E-46071 Valencia, Spain

来源：

IEEE TRANSACTIONS ON COMPUTERS | 2014年 / 63卷 / 03期

关键词：

3D Networks-on-Chip; unicast and multicast communication; partitioning methods; analytical models; adaptive routing algorithm; PERFORMANCE; SYSTEMS; DESIGN;

D O I：

10.1109/TC.2012.255

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Combining the benefits of 3D ICs and Networks-on-Chip (NoCs) schemes provides a significant performance gain in Chip Multiprocessors (CMPs) architectures. As multicast communication is commonly used in cache coherence protocols for CMPs and in various parallel applications, the performance of these systems can be significantly improved if multicast operations are supported at the hardware level. In this paper, we present several partitioning methods for the path-based multicast approach in 3D mesh-based NoCs, each with different levels of efficiency. In addition, we develop novel analytical models for unicast and multicast traffic to explore the efficiency of each approach. In order to distribute the unicast and multicast traffic more efficiently over the network, we propose the Minimal and Adaptive Routing (MAR) algorithm for the presented partitioning methods. The analytical and experimental results show that an advantageous method named Recursive Partitioning (RP) outperforms the other approaches. RP recursively partitions the network until all partitions contain a comparable number of switches and thus the multicast traffic is equally distributed among several subsets and the network latency is considerably decreased. The simulation results reveal that the RP method can achieve performance improvement across all workloads while performance can be further improved by utilizing the MAR algorithm. Nineteen percent average and 42 percent maximum latency reduction are obtained on SPLASH-2 and PARSEC benchmarks running on a 64-core CMP.

引用

页码：718 / 733

页数：16

共 44 条

[1]

Abad P, 2009, INT S HIGH PERF COMP, P355, DOI 10.1109/HPCA.2009.4798273

[2]

[Anonymous], P INT C PAR PROC

[3]

[Anonymous], 2007, J APPL SCI

[4]

[Anonymous], 2005, SIGARCH Comput. Archit. News

[5] 3-D ICs: A novel chip design for improving deep-submicrometer interconnect performance and systems-on-chip integration [J].

Banerjee, K ;

Souri, SJ ;

Kapur, P ;

Saraswat, KC .

PROCEEDINGS OF THE IEEE, 2001, 89 (05) :602-633

[6]

Beckmann BM, 2004, INT SYMP MICROARCH, P319

[7]

Bienia Christian, 2008, 2008 IEEE International Symposium on Workload Characterization (IISWC), P47, DOI 10.1109/IISWC.2008.4636090

[8] The PARSEC Benchmark Suite: Characterization and Architectural Implications [J].

Bienia, Christian ;

Kumar, Sanjeev ;

Singh, Jaswinder Pal ;

Li, Kai .

PACT'08: PROCEEDINGS OF THE SEVENTEENTH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, 2008, :72-81

[9] Resource deadlocks and performance of wormhole multicast routing algorithms [J].

Boppana, RV ;

Chalasani, S ;

Raghavendra, CS .

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1998, 9 (06) :535-549

[10]

Carara Everton Alceu, 2008, 2008 IEEE Computer Society Annual Symposium on VLSI, P341, DOI 10.1109/ISVLSI.2008.18

← 1 2 3 4 5 →