LLM4TDG: test-driven generation of large language models based on enhanced constraint reasoning

被引:0
作者
Liu, Jingqiang [1 ,2 ]
Liang, Ruigang [1 ]
Zhu, Xiaoxi [1 ]
Zhang, Yue [1 ,2 ]
Liu, Yuling [1 ,2 ]
Liu, Qixu [1 ]
机构
[1] Chinese Acad Sci, Inst Informat Engn, Beijing 100085, Peoples R China
[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing 100049, Peoples R China
关键词
Large language model; Test driven development; Software development; Software supply chain security;
D O I
10.1186/s42400-024-00335-4
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the evolution of modern software development paradigms, component reuse, and low-code approaches have emerged as mainstream in software development. However, developers often lack an in-depth understanding of reused code. The inability of components to operate autonomously leads to insufficient testing of software functionalities and security, further exacerbating the contradiction between the increasing complexity of software architectures and the demand for accurate and efficient software automation testing. This, in turn, increases the frequency of software supply chain security incidents. This paper proposes a test-driven generation framework, LLM4TDG, based on large language models (LLMs). By formally defining the constraint dependency graph and converting it into context constraints, LLMs' ability to understand natural language descriptions such as test requirements and documents is enhanced. Constraint reasoning and backtracking mechanisms are then used to generate test drivers that satisfy the defined constraints automatically. Using the EvalPlus dataset, we evaluate the comprehensive capabilities of LLM4TDG in test case generation using four general-domain LLMs and five code-generation-domain LLMs. The experimental results indicate that our approach significantly enhances LLMs' ability to comprehend constraints in testing objectives, achieving a 47.62% increase in constraint understanding across 147 testing tasks. Employing LLM4TDG significantly improves the average pass@k metric of all LLMs by 10.41%. The pass@k metric for CodeQwen-chat has improved by up to 18.66%. The metric surpasses the state-of-the-art GPT-4, with a performance of 92.16% on HUMANEVAL and 87.14% on HUMANEVAL+, which enhances the error correction and functional correctness in test-driven code generation. Meanwhile, Our experiments were conducted on a dataset of Python third-party libraries containing malicious behavior in the context of security testing tasks, validating the effectiveness of our method in real-world applications and its generalization capabilities.
引用
收藏
页数:23
相关论文
共 59 条
[1]  
Aniculaesei A, 2019, 11 INT C AD SELF AD, P69
[2]  
Aniculaesei A, 2018, 2018 IEEE 2ND INTERNATIONAL WORKSHOP ON VALIDATION, ANALYSIS AND EVOLUTION OF SOFTWARE TESTS (VST), P11, DOI 10.1109/VST.2018.8327150
[3]  
[Anonymous], 2018, INT C LEARN REPR
[4]  
[Anonymous], 2024, Openwall: Follow @Openwall on Twitter for new release announcements and other news
[5]  
[Anonymous], 2022, Sonatype: 8th annual state of the software supply chain
[6]  
[Anonymous], 2024, PyPI: radon
[7]  
Austin Jacob, 2021, arXiv
[8]  
Baliga N, 2017, IEEE AUTOTESTCON, P102
[9]   SMT-Based Translation Validation for Machine Learning Compiler [J].
Bang, Seongwon ;
Nam, Seunghyeon ;
Chun, Inwhan ;
Jhoo, Ho Young ;
Lee, Juneyoung .
COMPUTER AIDED VERIFICATION (CAV 2022), PT II, 2022, 13372 :386-407
[10]  
Brown TB, 2020, ADV NEUR IN, V33