A low-overhead soft-hard fault-tolerant architecture, design and management scheme for reliable high-performance many-core 3D-NoC systems

被引:14
作者
Dang, Khanh N. [1 ]
Meyer, Michael [1 ]
Okuyama, Yuichi [1 ]
Ben Abdallah, Abderazek [1 ]
机构
[1] Univ Aizu, Grad Sch Comp Sci & Engn, Adapt Syst Lab, Aizu Wakamatsu, Fukushima 9658580, Japan
关键词
3D NoCs; Fault-tolerance; Soft-hard faults; Reliability; Architecture; Design; ROUTING ALGORITHM; NETWORKS;
D O I
10.1007/s11227-016-1951-0
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The Network-on-Chip (NoC) paradigm has been proposed as a favorable solution to handle the strict communication requirements between the increasingly large number of cores on a single chip. However, NoC systems are exposed to the aggressive scaling down of transistors, low operating voltages, and high integration and power densities, making them vulnerable to permanent (hard) faults and transient (soft) errors. A hard fault in a NoC can lead to external blocking, causing congestion across the whole network. A soft error is more challenging because of its silent data corruption, which leads to a large area of erroneous data due to error propagation, packet re-transmission, and deadlock. In this paper, we present the architecture and design of a comprehensive soft error and hard fault-tolerant 3D-NoC system, named 3D-Hard-Fault-Soft-Error-Tolerant-OASIS-NoC (3D-FETO). With the aid of efficient mechanisms and algorithms, 3D-FETO is capable of detecting and recovering from soft errors which occur in the routing pipeline stages and leverages reconfigurable components to handle permanent faults in links, input buffers, and crossbars. In-depth evaluation results show that the 3D-FETO system is able to work around different kinds of hard faults and soft errors, ensuring graceful performance degradation, while minimizing additional hardware complexity and remaining power efficient.
引用
收藏
页码:2705 / 2729
页数:25
相关论文
共 39 条
[1]  
Abdel-Rahman AI, 2006, PORTABLE EMERGENCY E, P1
[2]  
Ahmed A. B., 2012, 2012 IEEE 6th International Symposium on Embedded Multicore SoCs (MCSoC), P167, DOI 10.1109/MCSoC.2012.24
[3]  
Ben Abdallah A, 2013, MULTICORE SYSTEMS CH
[4]   Adaptive fault-tolerant architecture and routing algorithm for reliable many-core 3D-NoC systems [J].
Ben Ahmed, Akram ;
Ben Abdallah, Abderazek .
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2016, 93-94 :30-43
[5]   Graceful deadlock-free fault-tolerant routing algorithm for 3D Network-on-Chip architectures [J].
Ben Ahmed, Akram ;
Ben Abdallah, Abderazek .
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2014, 74 (04) :2229-2240
[6]   Architecture and design of high-throughput, low-latency, and fault-tolerant routing algorithm for 3D-network-on-chip (3D-NoC) [J].
Ben Ahmed, Akram ;
Ben Abdallah, Abderazek .
JOURNAL OF SUPERCOMPUTING, 2013, 66 (03) :1507-1532
[7]   Low-overhead Routing Algorithm for 3D Network-on-Chip [J].
Ben Ahmed, Akram ;
Ben Abdallah, Abderazek .
2012 THIRD INTERNATIONAL CONFERENCE ON NETWORKING AND COMPUTING (ICNC 2012), 2012, :23-32
[8]   Error control schemes for on-chip communication links: The energy-reliability tradeoff [J].
Bertozzi, D ;
Benini, L ;
De Micheli, G .
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2005, 24 (06) :818-831
[9]   NoC synthesis flow for customized domain specific multiprocessor systems-on-chip [J].
Bertozzi, D ;
Jalabert, A ;
Murali, S ;
Tamhankar, R ;
Stergiou, S ;
Benini, L ;
De Micheli, G .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2005, 16 (02) :113-129
[10]  
Chen P, 2010, PROCEEDINGS OF THE 2010 IEEE ASIA PACIFIC CONFERENCE ON CIRCUIT AND SYSTEM (APCCAS), P1091, DOI 10.1109/APCCAS.2010.5774970