Efficient Fault-Criticality Analysis for AI Accelerators using a Neural Twin

被引:12
作者
Chaudhuri, Arjun [1 ]
Chen, Ching-Yuan [1 ]
Talukdar, Jonti [1 ]
Madala, Siddarth [2 ,3 ]
Dubey, Abhishek Kumar [4 ]
Chakrabarty, Krishnendu [1 ]
机构
[1] Duke Univ, Dept Elect & Comp Engn, Durham, NC 27708 USA
[2] Duke Univ, Dept Stat, Durham, NC USA
[3] Duke Univ, Dept Comp Sci, Durham, NC 27706 USA
[4] NCI, Ctr Canc Res, Bethesda, MD 20892 USA
来源
2021 IEEE INTERNATIONAL TEST CONFERENCE (ITC 2021) | 2021年
关键词
D O I
10.1109/ITC50571.2021.00015
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Owing to the inherent fault tolerance of deep neural network (DNN) models used for classification, many structural faults in the processing elements (PEs) of a systolic array-based AI accelerator are functionally benign. Brute-force fault simulation for determining fault criticality is computationally expensive due to many potential fault sites in the accelerator array and the dependence of criticality characterization of PEs on the functional input data. Supervised learning techniques can be used to accurately estimate fault criticality but it requires ground truth for model training. The ground-truth collection involves extensive and computationally expensive fault simulations. We present a framework for analyzing fault criticality with a negligible amount of ground-truth data. We incorporate the gate-level structural and functional information of the PEs in their "neural twins", referred to as "PE-Nets". The PE netlist is translated into a trainable PE-Net, where the standard-cell instances are substituted by their corresponding "Cell-Nets" and the wires translate to neural connections. Each Cell-Net is a pre-trained DNN that models the Boolean-logic behavior of the corresponding standard cell. In the PE-Net, every neural connection is associated with a bias that represents a perturbation in the signal propagated by that connection. We utilize a recently proposed misclassification-driven training algorithm to sensitize and identify biases that are critical to the functioning of the accelerator for a given application workload. The proposed framework achieves up to 100% accuracy in fault-criticality classification in 16-bit and 32-bit PEs by using the criticality knowledge of only 2% of the total faults in a PE.
引用
收藏
页码:73 / 82
页数:10
相关论文
共 21 条
[1]  
Chaudhuri A., DATE, P2021
[2]  
Chaudhuri A., 2020, ATS
[3]  
Chaudhuri A., 2020, ITC
[4]  
Chen C., 2021, DATE
[5]   Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks [J].
Chen, Yu-Hsin ;
Krishna, Tushar ;
Emer, Joel S. ;
Sze, Vivienne .
IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2017, 52 (01) :127-138
[6]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[7]  
Gebregiorgis A., 2019, ITC
[8]  
Glantz S.A., 2001, Primer of Applied Regression and Analysis of Variance
[9]  
Guo X., 2012, ISRN Bioinformatics, V2012
[10]   In-Datacenter Performance Analysis of a Tensor Processing Unit [J].
Jouppi, Norman P. ;
Young, Cliff ;
Patil, Nishant ;
Patterson, David ;
Agrawal, Gaurav ;
Bajwa, Raminder ;
Bates, Sarah ;
Bhatia, Suresh ;
Boden, Nan ;
Borchers, Al ;
Boyle, Rick ;
Cantin, Pierre-luc ;
Chao, Clifford ;
Clark, Chris ;
Coriell, Jeremy ;
Daley, Mike ;
Dau, Matt ;
Dean, Jeffrey ;
Gelb, Ben ;
Ghaemmaghami, Tara Vazir ;
Gottipati, Rajendra ;
Gulland, William ;
Hagmann, Robert ;
Ho, C. Richard ;
Hogberg, Doug ;
Hu, John ;
Hundt, Robert ;
Hurt, Dan ;
Ibarz, Julian ;
Jaffey, Aaron ;
Jaworski, Alek ;
Kaplan, Alexander ;
Khaitan, Harshit ;
Killebrew, Daniel ;
Koch, Andy ;
Kumar, Naveen ;
Lacy, Steve ;
Laudon, James ;
Law, James ;
Le, Diemthu ;
Leary, Chris ;
Liu, Zhuyuan ;
Lucke, Kyle ;
Lundin, Alan ;
MacKean, Gordon ;
Maggiore, Adriana ;
Mahony, Maire ;
Miller, Kieran ;
Nagarajan, Rahul ;
Narayanaswami, Ravi .
44TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2017), 2017, :1-12