A Reliable Routing Architecture and Algorithm for NoCs

被引:62
作者
DeOrio, Andrew [1 ]
Fick, David [1 ]
Bertacco, Valeria [1 ]
Sylvester, Dennis [1 ]
Blaauw, David [1 ]
Hu, Jin [1 ]
Chen, Gregory [2 ]
机构
[1] Univ Michigan, Ann Arbor, MI 48109 USA
[2] Intel, High Performance Circuits Res Grp, Hillsboro, OR 97124 USA
基金
美国国家科学基金会;
关键词
Fault tolerance; hard faults; networks-on-chip (NoCs); reconfiguration; reliability; routing algorithms; FAULT; NETWORK; RECOVERY; MESH;
D O I
10.1109/TCAD.2011.2181509
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Aggressive transistor scaling continues to drive increasingly complex digital designs. The large number of transistors available today enables the development of chip multiprocessors that include many cores on one die communicating through an on-chip interconnect. As the number of cores increases, scalable communication platforms, such as networks-on-chip (NoCs), have become more popular. However, as the sole communication medium, these interconnects are a single point of failure so that any permanent fault in the NoC can cause the entire system to fail. Compounding the problem, transistors have become increasingly susceptible to wear-out related failures as their critical dimensions shrink. As a result, the on-chip network has become a critically exposed unit that must be protected. To this end, we present Vicis, a fault-tolerant architecture and companion routing protocol that is robust to a large number of permanent failures, allowing communication to continue in the face of permanent transistor failures. Vicis makes use of a two-level approach. First, it attempts to work around errors within a router by leveraging reconfigurable architectural components. Second, when faults within a router disable a link's connectivity, or even an entire router, Vicis reroutes around the faulty node or link with a novel, distributed routing algorithm for meshes and tori. Tolerating permanent faults in both the router components and the reliability hardware itself, Vicis enables graceful performance degradation of networks-on-chip.
引用
收藏
页码:726 / 739
页数:14
相关论文
共 60 条
[1]  
Alam M., 2003, P IDEM DEC, P1441
[2]  
[Anonymous], 1983, Error control coding
[3]  
Bell S., 2008, P 2008 IEEE INT SOL, DOI DOI 10.1109/ISSCC.2008.4523070
[4]   The PARSEC Benchmark Suite: Characterization and Architectural Implications [J].
Bienia, Christian ;
Kumar, Sanjeev ;
Singh, Jaswinder Pal ;
Li, Kai .
PACT'08: PROCEEDINGS OF THE SEVENTEENTH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, 2008, :72-81
[5]   Stochastic Communication: A New Paradigm for Fault-Tolerant Networks-on-Chip [J].
Bogdan, Paul ;
Dumitras, Tudor ;
Marculescu, Radu .
VLSI DESIGN, 2007,
[6]  
Borkar S., 2004, Proceedings. 37th International Symposium on Microarchitecture
[7]  
Borkar S., 2007, Design, Automation Test in Europe Conference Exhibition, 2007, P1
[8]   Communication in multicomputers with nonconvex faults [J].
Chalasani, S ;
Boppana, RV .
IEEE TRANSACTIONS ON COMPUTERS, 1997, 46 (05) :616-622
[9]  
Cheng Liu, 2011, 2011 16th Asia and South Pacific Design Automation Conference, ASP-DAC 2011, P437, DOI 10.1109/ASPDAC.2011.5722230
[10]  
Cherkasova L., 1996, Proceedings of the Twenty-Ninth Hawaii International Conference on System Sciences, P53, DOI 10.1109/HICSS.1996.495447