LLM4TDG: test-driven generation of large language models based on enhanced constraint reasoning

被引:0
作者
Liu, Jingqiang [1 ,2 ]
Liang, Ruigang [1 ]
Zhu, Xiaoxi [1 ]
Zhang, Yue [1 ,2 ]
Liu, Yuling [1 ,2 ]
Liu, Qixu [1 ]
机构
[1] Chinese Acad Sci, Inst Informat Engn, Beijing 100085, Peoples R China
[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing 100049, Peoples R China
关键词
Large language model; Test driven development; Software development; Software supply chain security;
D O I
10.1186/s42400-024-00335-4
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the evolution of modern software development paradigms, component reuse, and low-code approaches have emerged as mainstream in software development. However, developers often lack an in-depth understanding of reused code. The inability of components to operate autonomously leads to insufficient testing of software functionalities and security, further exacerbating the contradiction between the increasing complexity of software architectures and the demand for accurate and efficient software automation testing. This, in turn, increases the frequency of software supply chain security incidents. This paper proposes a test-driven generation framework, LLM4TDG, based on large language models (LLMs). By formally defining the constraint dependency graph and converting it into context constraints, LLMs' ability to understand natural language descriptions such as test requirements and documents is enhanced. Constraint reasoning and backtracking mechanisms are then used to generate test drivers that satisfy the defined constraints automatically. Using the EvalPlus dataset, we evaluate the comprehensive capabilities of LLM4TDG in test case generation using four general-domain LLMs and five code-generation-domain LLMs. The experimental results indicate that our approach significantly enhances LLMs' ability to comprehend constraints in testing objectives, achieving a 47.62% increase in constraint understanding across 147 testing tasks. Employing LLM4TDG significantly improves the average pass@k metric of all LLMs by 10.41%. The pass@k metric for CodeQwen-chat has improved by up to 18.66%. The metric surpasses the state-of-the-art GPT-4, with a performance of 92.16% on HUMANEVAL and 87.14% on HUMANEVAL+, which enhances the error correction and functional correctness in test-driven code generation. Meanwhile, Our experiments were conducted on a dataset of Python third-party libraries containing malicious behavior in the context of security testing tasks, validating the effectiveness of our method in real-world applications and its generalization capabilities.
引用
收藏
页数:23
相关论文
共 59 条
[41]  
Radford A., 2018, IMPROVING LANGUAGE U
[42]  
Rani Shweta, 2019, Progress in Advanced Computing and Intelligent Engineering. Proceedings of ICACIE 2017. Advances in Intelligent Systems and Computing (AISC 714), P113, DOI 10.1007/978-981-13-0224-4_11
[43]  
Schneider S, 2021, Using model-based testing for creating behaviour-driven tests
[44]  
Souza F.C. M., 2014, Conf. Softw. Eng, P419
[45]  
Synopsys, 2023, Open source security and risk analysis report
[46]  
Tzeng GH., 2011, LECT NOTES EC MATH S, V404, P287
[47]  
Wagner F., 2006, MODELING SOFTWARE FI
[48]  
Wang L, 2005, Computer Science
[49]  
Whalen MW., 2006, P INT S SOFTWARE TES, P25
[50]   On the Unusual Effectiveness of Type-Aware Operator Mutations for Testing SMT Solvers [J].
Winterer, Dominik ;
Zhang, Chengyu ;
Su, Zhendong .
PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, 2020, 4 (OOPSLA)