Efficient Implementation of CNTR/CTRU Key Encapsulation Mechanism Based on Cortex-M4

被引:0
作者
Wei H.-Y. [1 ]
Zheng J.-Y. [2 ]
Zhao Y.-L. [1 ,2 ]
机构
[1] College of Computer Science and Technology, Fudan University, Shanghai
[2] Stale Key Laboratory of Cryplology, Beijing
来源
Jisuanji Xuebao/Chinese Journal of Computers | 2024年 / 47卷 / 03期
关键词
ARM Cortex-M4 implementation; key encapsulation mechanism; number; polynomial arithmetic; post-quantum cryptography; theoretic transform;
D O I
10.11897/SP.J.1016.2024.00589
中图分类号
学科分类号
摘要
The rapid advancement of quantum computing technology poses an imminent and formidable threat to the foundations of our existing public key cryptography systems. As researchers grapple with the urgent need to safeguard our digital infrastructure against the imminent threat of quantum attacks, the concept of Post-Quantum Cryptography(PQC) has emerged as a dynamic and thriving field of study. By exploring innovative cryptographic techniques and mathematical constructs, researchers in the field of PQC strive to ensure the long-term security and resilience of our digital communication systems in the face of the impending quantum revolution. At present, the security issues of the Internet of Things (IoT) have attracted much attention. ARM Cortex-M4, as a low-power embedded processor, is widely used in IoT devices, thus deploying post-quantum cryptography algorithms on it will provide more reliable guarantees for the security of IoT devices. CNTR and CTRU are NTRU lattice-based Key Encapsulation Mechanisms (KEM) proposed by Chinese scholars. These schemes offer comprehensive advantages in terms of security and performance compared to lattice-based KEMs based on the Learning With Errors (LWE) technical route. They have received funding from the Cryptography Standardization Technical Committee (CSTC) in China. The primary focus of this research paper is to efficaciously and succinctly implement the CNTR and CTRU schemes on the ARM Cortex-M4 platform. Leveraging the power of Single Instruction Multiple Data (SIMD) instructions, strategically adjusting the operation structure, and optimizing the instruction arrangement, we have succeeded in significantly augmenting the speed and optimizing the stack usage of these algorithms. The main contributions of this paper include: Firstly, this paper introduces the implementation of the time-consuming module polynomial Central Binomial Distribution (CBD) sampling on ARM Cortex-M4 for the first time. This implementation accelerates the sampling process by an impressive 32. 49%. Additionally, mixed-radix Number Theoretic Transform (NTT) is employed to expedite NTT-unfriendly polynomial multiplication. By fully utilizing the Floating-Point Unit (FPU) registers and employing layer merging in NTT to minimize time-consuming operations like load and store, the speed of NTT and inverse NTT experiences substantial improvements of 84. 24% and 81. 15% respectively. Moreover, this research employs coefficient range analysis during the NTT process to delay reduction, thereby reducing the number of reductions required. Furthermore, it utilizes improved Barrett reduction and Montgomery reduction techniques to minimize the number of assembly instructions involved. Polynomial inversion is achieved through loop unrolling, optimizing the time-consuming process and resulting in a speed optimization rate of 68. 85%. To accelerate the decryption process, which involves NTT-unfriendly prime modulus polynomial ring multiplication, a combination of multi-moduli NTT and the Chinese Remainder Theorem (CRT) is employed, providing a remarkable speed improvement of 96.26%. Additionally, space multiplexing is utilized to optimize stack usage, resulting in stack usage reductions of 29. 86% and 28. 17% for CNTR and CTRU respectively. Experimental results demonstrate that the proposed optimization techniques significantly enhance algorithm efficiency. Compared to the C reference implementation, CNTR and CTRU exhibit overall speed optimization rates of 85. 54% and 85. 56% respectively. Furthermore, in a comparative analysis with the state-of-the-art implementation of other lattice-based key encapsulation mechanisms on the ARM Cortex-M4 platform, the optimized implementation presented in this paper clearly demonstrates comprehensive advantages across multiple dimensions, including speed, efficient utilization of space, and robust security measures. © 2024 Science Press. All rights reserved.
引用
收藏
页码:589 / 607
页数:18
相关论文
共 39 条
[1]  
Shor P W., Algorithms for quantum computation: Discrete logarithms and factoring, Proceedings of the 35th Annual Symposium on Foundations of Computer Science, pp. 124-134, (1994)
[2]  
Post Quantum Cryptography Round 3 Submissions
[3]  
PQC standardization process: Announcing four candidates to be standardized, plus fourth round candidates
[4]  
Regev O., On lattices, learning with errors, random linear codes, and cryptography, Journal of the ACM, 56, 6, pp. 1-40, (2009)
[5]  
Lyubashevsky V, Peikert C, Regev O., On ideal lattices and learning with errors over rings, Journal of the ACM, 60, 6, pp. 1-35, (2013)
[6]  
Langlois A, Stehle D., Worst-case to average-case reductions for module lattices, Designs, Codes and Cryptography, 75, 3, pp. 565-599, (2015)
[7]  
Banerjee A, Peikert C, Rosen A., Pseudorandom functions and lattices, Proceedings of the Advances in Cryptology— EUROCRYPT 2012: 31st Annual International Conference on the Theory and Applications of Cryptographic Techniques, pp. 719-737, (2012)
[8]  
Alperin-Sheriff J, Apon D., Dimension-preserving reductions from LWE to LWR, Cryptology ePrint Archive, (2016)
[9]  
Iloffstein J, Pipher J, Silverman J, NTRU: A ring-based public key cryptosystem, Proceedings of the Algorithmic Number Theory: Third International Symposiun
[10]  
Chen C, Danba O, Iloffstein J, Et al., NTRU algorithm specifications and supporting documentation, (2020)