On the Instability of Softmax Attention-Based Deep Learning Models in Side-Channel Analysis

被引:2
作者
Hajra, Suvadeep [1 ]
Alam, Manaar [2 ]
Saha, Sayandeep [3 ]
Picek, Stjepan [4 ]
Mukhopadhyay, Debdeep [1 ,5 ]
机构
[1] Indian Inst Technol Kharagpur, Dept Comp Sci & Engn, Kharagpur 721302, India
[2] New York Univ Abu Dhabi, Ctr Cyber Secur, Abu Dhabi, U Arab Emirates
[3] Univ Catholic Louvain, UCL Crypto Grp, ICTEAM ELEN, B-1348 Ottignies Louvaine La Neu, Belgium
[4] Radboud Univ Nijmegen, Digital Secur Grp, NL-6525 XZ Nijmegen, Netherlands
[5] New York Univ Abu Dhabi, Sch Comp Engn, Abu Dhabi, U Arab Emirates
关键词
Convolutional neural networks; Training; Noise measurement; Feature extraction; Signal to noise ratio; Recurrent neural networks; Computational modeling; Side-channel analysis; deep learning; softmax attention; multi-head attention;
D O I
10.1109/TIFS.2023.3326667
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In side-channel analysis (SCA), Points-of-Interest (PoIs), i.e., the informative sample points remain sparsely scattered across the whole side-channel trace. Several works in the SCA literature have demonstrated that the attack efficacy could be significantly improved by combining information from the sparsely occurring PoIs. In Deep Learning (DL), a common approach for combining the information from the sparsely occurring PoIs is softmax attention. This work studies the training instability of the softmax attention-based CNN models on long traces. We show that the softmax attention-based CNN model incurs an unstable training problem when applied to longer traces (e.g., traces having a length greater than $10K$ sample points). We also explore the use of batch normalization and multi-head softmax attention to make the CNN models stable. Our results show that the use of a large number of batch normalization layers and/or multi-head softmax attention (replacing the vanilla softmax attention) can make the models significantly more stable, resulting in better attack efficacy. Moreover, we found our models to achieve similar or better results (up to 85% reduction in the minimum number of the required traces to reach the guessing entropy 1) than the state-of-the-art results on several synchronized and desynchronized datasets. Finally, by plotting the loss surface of the DL models, we demonstrate that using multi-head softmax attention instead of vanilla softmax attention in the CNN models can make the loss surface significantly smoother.
引用
收藏
页码:514 / 528
页数:15
相关论文
共 24 条
[1]  
Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
[2]   Deep learning for side-channel analysis and introduction to ASCAD database [J].
Benadjila, Ryad ;
Prouff, Emmanuel ;
Strullu, Remi ;
Cagli, Eleonora ;
Dumas, Cecile .
JOURNAL OF CRYPTOGRAPHIC ENGINEERING, 2020, 10 (02) :163-188
[3]  
Bronchain O., 2021, IACR Cryptol. ePrint Arch., P817
[4]   Convolutional Neural Networks with Data Augmentation Against Jitter-Based Countermeasures Profiling Attacks Without Pre-processing [J].
Cagli, Eleonora ;
Dumas, Cecile ;
Prouff, Emmanuel .
CRYPTOGRAPHIC HARDWARE AND EMBEDDED SYSTEMS - CHES 2017, 2017, 10529 :45-68
[5]  
Chari S, 2002, LECT NOTES COMPUT SC, V2523, P13
[6]  
Chari S., 1999, Advances in Cryptology - CRYPTO'99. 19th Annual International Cryptology Conference. Proceedings, P398
[7]  
Coron JS, 2009, LECT NOTES COMPUT SC, V5747, P156
[8]  
Herbst C, 2006, LECT NOTES COMPUT SC, V3989, P239
[9]  
Keskar N. S., 2017, 5 INT C LEARN REPR
[10]  
Kim J., 2019, IACR Trans. Cryptogr. Hardware Embedded Syst., V2019, P148, DOI [DOI 10.13154/TCHES.V2019.I3.148-179, 10.46586/tches.v2019.i3.148-179]