Can Large Language Models (LLMs) Compete with Human Requirements Reviewers? - Replication of an Inspection Experiment on Requirements Documents

被引:0
作者
Seifert, Daniel [1 ]
Joeckel, Lisa [1 ]
Trendowicz, Adam [1 ]
Ciolkowski, Marcus [2 ]
Honroth, Thorsten [1 ]
Jedlitschka, Andreas [1 ]
机构
[1] Fraunhofer Inst Expt SE IESE, Fraunhofer Pl 1, D-67663 Kaiserslautern, Germany
[2] QAware GmbH, Aschauer St 30, D-81549 Munich, Germany
来源
PRODUCT-FOCUSED SOFTWARE PROCESS IMPROVEMENT, PROFES 2024 | 2025年 / 15452卷
关键词
Artificial Intelligence; Machine Learning; Requirements Engineering; Quality Assurance; Study;
D O I
10.1007/978-3-031-78386-9_3
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The use of large language models (LLMs) for software engineering is growing, especially for code - typically to generate code or to detect or fix quality problems. Because requirements are often written in natural language, it seems promising to exploit the capabilities of LLMs to detect requirement problems. We replicated an inspection experiment in which computer science students searched for defects in requirements documents using different reading techniques. In our replication, we used the LLM GPT-4-Turbo instead of students to determine how the model compares to human reviewers. Additionally, we considered GPT-3.5-Turbo, Nous-Hermes-2-Mixtral-8x7B-DPO, and Phi-3-medium-128k-instruct for one research question. We focus on single prompt approaches and avoid more complex approaches to mimic the original study design where students received all the material at once. We had two phases. First, we explored the general feasibility of using LLMs for requirements inspection on a practice document and examined different prompts. Second, we applied selected approaches to two requirements documents and compared the approaches to each other and to human reviewers. The approaches include variations in reading techniques (ad-hoc, perspective-based, checklist-based), LLMs, the instructions, and material provided. We found that LLMs (a) report only a limited number of deficits despite having enough tokens, which (b) do not vary much across prompts. They (c) rarely match the sample solution.
引用
收藏
页码:27 / 42
页数:16
相关论文
共 27 条
[1]  
Arora C, 2023, Arxiv, DOI arXiv:2310.13976
[2]  
Basili V.R., 1996, Empirical SE
[3]  
Berhanu F., 2023, ICT4DA
[4]   Software reviews: The state of the practice [J].
Ciolkowski, M ;
Laitenberger, O ;
Biffl, S .
IEEE SOFTWARE, 2003, 20 (06) :46-+
[5]  
Ciolkowski M., 1997, Empirical investigation of perspective-based reading: a replicated experiment
[6]   What Do We Know About Perspective-Based Reading? An Approach for Quantitative Aggregation in Software Engineering [J].
Ciolkowski, Marcus .
ESEM: 2009 3RD INTERNATIONAL SYMPOSIUM ON EMPIRICAL SOFTWARE ENGINEERING AND MEASUREMENT, 2009, :133-144
[7]  
Habib M.K., 2021, AIRE
[8]  
Hou XY, 2024, Arxiv, DOI arXiv:2308.10620
[9]  
Krasner H., 2022, The cost of poor software quality in the US: A 2022 report
[10]  
Krishna M, 2024, Arxiv, DOI arXiv:2404.17842