An Empirical Study of Code Smells in Transformer-based Code Generation Techniques

被引:25
|
作者
Siddiq, Mohammed Latif [1 ]
Majumder, Shafayat H. [2 ]
Mim, Maisha R. [2 ]
Jajodia, Sourov [2 ]
Santos, Joanna C. S. [1 ]
机构
[1] Univ Notre Dame, Dept Comp Sci & Engn, Notre Dame, IN 46556 USA
[2] Bangladesh Univ Engn & Technol, Dept Comp Sci, Dhaka, Bangladesh
来源
2022 IEEE 22ND INTERNATIONAL WORKING CONFERENCE ON SOURCE CODE ANALYSIS AND MANIPULATION (SCAM 2022) | 2022年
关键词
code generation; code smell; security smell; transformer; pre-trained model; GitHub copilot; SOFTWARE;
D O I
10.1109/SCAM55253.2022.00014
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Prior works have developed transformer-based language learning models to automatically generate source code for a task without compilation errors. The datasets used to train these techniques include samples from open source projects which may not be free of security flaws, code smells, and violations of standard coding practices. Therefore, we investigate to what extent code smells are present in the datasets of coding generation techniques and verify whether they leak into the output of these techniques. To conduct this study, we used Pylint and Bandit to detect code smells and security smells in three widely used training sets (CodeXGlue, APPS, and Code Clippy). We observed that Pylint caught 264 code smell types, whereas Bandit located 44 security smell types in these three datasets used for training code generation techniques. By analyzing the output from ten different configurations of the open-source finetuned transformer-based GPT-Neo 125M parameters model, we observed that this model leaked the smells and non-standard practices to the generated source code. When analyzing GitHub Copilot's suggestions, a closed source code generation tool, we observed that it contained 18 types of code smells, including substandard coding patterns and 2 security smell types.
引用
收藏
页码:71 / 82
页数:12
相关论文
共 50 条
  • [1] From Fine-tuning to Output: An Empirical Investigation of Test Smells in Transformer-Based Test Code Generation
    Aljohani, Ahmed
    Do, Hyunsook
    39TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, SAC 2024, 2024, : 1282 - 1291
  • [2] SeTransformer: A Transformer-Based Code Semantic Parser for Code Comment Generation
    Li, Zheng
    Wu, Yonghao
    Peng, Bin
    Chen, Xiang
    Sun, Zeyu
    Liu, Yong
    Paul, Doyle
    IEEE TRANSACTIONS ON RELIABILITY, 2023, 72 (01) : 258 - 273
  • [3] Are architectural smells independent from code smells? An empirical study
    Fontana, Francesca Arcelli
    Lenarduzzi, Valentina
    Roveda, Riccardo
    Taibi, Davide
    JOURNAL OF SYSTEMS AND SOFTWARE, 2019, 154 : 139 - 156
  • [4] Empirical Study on Code Smells in iOS Applications
    Rahkema, Kristiina
    Pfahl, Dietmar
    2020 IEEE/ACM 7TH INTERNATIONAL CONFERENCE ON MOBILE SOFTWARE ENGINEERING AND SYSTEMS, MOBILESOFT, 2020, : 61 - 65
  • [5] An Empirical Study of the Performance Impacts of Android Code Smells
    Hecht, Geoffrey
    Moha, Naouel
    Rouvoy, Romain
    2016 IEEE/ACM INTERNATIONAL CONFERENCE ON MOBILE SOFTWARE ENGINEERING AND SYSTEMS (MOBILESOFT 2016), 2016, : 59 - 69
  • [6] An empirical study of Android behavioural code smells detection
    Prestat, Dimitri
    Moha, Naouel
    Villemaire, Roger
    EMPIRICAL SOFTWARE ENGINEERING, 2022, 27 (07)
  • [7] An empirical study of Android behavioural code smells detection
    Dimitri Prestat
    Naouel Moha
    Roger Villemaire
    Empirical Software Engineering, 2022, 27
  • [8] An Empirical Study of Code Smells in Java']JavaScript Projects
    Saboury, Amir
    Musavi, Pooya
    Khomh, Foutse
    Antoniol, Giulio
    2017 IEEE 24TH INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION, AND REENGINEERING (SANER), 2017, : 294 - 305
  • [9] A taxonomy and an initial empirical study of bad smells in code
    Mäntylä, M
    Vanhanen, J
    Lassenius, C
    INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE, PROCEEDINGS, 2003, : 381 - 384
  • [10] ALSI-Transformer: Transformer-Based Code Comment Generation With Aligned Lexical and Syntactic Information
    Park, Youngmi
    Park, Ahjeong
    Kim, Chulyun
    IEEE ACCESS, 2023, 11 : 39037 - 39047