Evaluating Large Language Models in Code Generation: INFINITE Methodology for Defining the Inference Index

被引:0
作者
Christakis, Nicholas [1 ]
Drikakis, Dimitris [1 ]
机构
[1] Univ Nicosia, Inst Adv Modeling & Simulat, CY-2417 Nicosia, Cyprus
来源
APPLIED SCIENCES-BASEL | 2025年 / 15卷 / 07期
关键词
LLM; forecasting; inference; time series; LSTM; artificial intelligence;
D O I
10.3390/app15073784
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
This study introduces a new methodology for an Inference Index (InI) called the Inference Index In Testing Model Effectiveness methodology (INFINITE), aiming to evaluate the performance of Large Language Models (LLMs) in code generation tasks. The InI index provides a comprehensive assessment focusing on three key components: efficiency, consistency, and accuracy. This approach encapsulates time-based efficiency, response quality, and the stability of model outputs, offering a thorough understanding of LLM performance beyond traditional accuracy metrics. We apply this methodology to compare OpenAI's GPT-4o (GPT), OpenAI-o1 pro (OAI1), and OpenAI-o3 mini-high (OAI3) in generating Python code for two tasks: a data-cleaning and statistical computation task and a Long Short-Term Memory (LSTM) model generation task for forecasting meteorological variables such as temperature, relative humidity, and wind speed. Our findings demonstrate that GPT outperforms OAI1 and performs comparably to OAI3 regarding accuracy and workflow efficiency. The study reveals that LLM-assisted code generation can produce results similar to expert-designed models with effective prompting and refinement. GPT's performance advantage highlights the benefits of widespread use and user feedback. These findings contribute to advancing AI-assisted software development, providing a structured approach for evaluating LLMs in coding tasks and setting the groundwork for future studies on broader model comparisons and expanded assessment frameworks.
引用
收藏
页数:24
相关论文
共 71 条
[1]  
2023, Arxiv, DOI arXiv:2303.08774
[2]  
Ajwani R, 2024, Arxiv, DOI [arXiv:2405.06800, DOI 10.48550/ARXIV.2405.06800]
[3]  
Amodei D, 2016, PR MACH LEARN RES, V48
[4]  
Bae H, 2024, Arxiv, DOI [arXiv:2312.11511, 10.48550/arXiv.2312.11511 2312.11511]
[5]  
Beyer T, 2025, Arxiv, DOI [arXiv:2502.10487, 10.48550/arXiv.2502.10487 2502.10487, DOI 10.48550/ARXIV.2502.104872502.10487]
[6]   PM and light extinction model performance metrics, goals, and criteria for three-dimensional air quality models [J].
Boylan, James W. ;
Russell, Armistead G. .
ATMOSPHERIC ENVIRONMENT, 2006, 40 (26) :4946-4959
[7]   Claude 2.0 large language model: Tackling a real-world classification problem with a new iterative prompt engineering approach [J].
Caruccio, Loredana ;
Cirillo, Stefano ;
Polese, Giuseppe ;
Solimando, Giandomenico ;
Sundaramurthy, Shanmugam ;
Tortora, Genoveffa .
INTELLIGENT SYSTEMS WITH APPLICATIONS, 2024, 21
[8]   Forecast of rainfall distribution based on fixed sliding window long short-term memory [J].
Chen, Chengcheng ;
Zhang, Qian ;
Kashani, Mahsa H. ;
Jun, Changhyun ;
Bateni, Sayed M. ;
Band, Shahab S. ;
Dash, Sonam Sandeep ;
Chau, Kwok-Wing .
ENGINEERING APPLICATIONS OF COMPUTATIONAL FLUID MECHANICS, 2022, 16 (01) :248-261
[9]   LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators [J].
Chitty-Venkata, Krishna Teja ;
Raskar, Siddhisanket ;
Kale, Bharat ;
Ferdaus, Farah ;
Tanikanti, Aditya ;
Raffenetti, Ken ;
Taylor, Valerie ;
Emani, Murali ;
Vishwanath, Venkatram .
PROCEEDINGS OF SC24-W: WORKSHOPS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2024, :1362-1379
[10]   On the Performance of the WRF Numerical Model over Complex Terrain on a High Performance Computing Cluster [J].
Christakis, Nicholas ;
Katsaounis, Theodoros ;
Kossioris, George ;
Plexousakis, Michael .
2014 IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, 2014 IEEE 6TH INTL SYMP ON CYBERSPACE SAFETY AND SECURITY, 2014 IEEE 11TH INTL CONF ON EMBEDDED SOFTWARE AND SYST (HPCC,CSS,ICESS), 2014, :298-303