Semantic Bug Seeding: A Learning-Based Approach for Creating Realistic Bugs

被引:43
作者
Patra, Jibesh [1 ]
Pradel, Michael [1 ]
机构
[1] Univ Stuttgart, Stuttgart, Germany
来源
PROCEEDINGS OF THE 29TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (ESEC/FSE '21) | 2021年
基金
欧洲研究理事会;
关键词
bugs; bug injection; machine learning; dataset; token embeddings; CODE;
D O I
10.1145/3468264.3468623
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
When working on techniques to address the wide-spread problem of software bugs, one often faces the need for a large number of realistic bugs in real-world programs. Such bugs can either help evaluate an approach, e.g., in form of a bug benchmark or a suite of program mutations, or even help build the technique, e.g., in learning-based bug detection. Because gathering a large number of real bugs is difficult, a common approach is to rely on automatically seeded bugs. Prior work seeds bugs based on syntactic transformation patterns, which often results in unrealistic bugs and typically cannot introduce new, application-specific code tokens. This paper presents SemSeed, a technique for automatically seeding bugs in a semantics-aware way. The key idea is to imitate how a given real-world bug would look like in other programs by semantically adapting the bug pattern to the local context. To reason about the semantics of pieces of code, our approach builds on learned token embeddings that encode the semantic similarities of identifiers and literals. Our evaluation with real-world JavaScript software shows that the approach effectively reproduces real bugs and clearly outperforms a semantics-unaware approach. The seeded bugs are useful as training data for learning-based bug detection, where they significantly improve the bug detection ability. Moreover, we show that SemSeed-created bugs complement existing mutation testing operators, and that our approach is efficient enough to seed hundreds of thousands of bugs within an hour.
引用
收藏
页码:906 / 918
页数:13
相关论文
共 61 条
  • [1] Allamanis, 2018, The adverse effects of code duplication in machine learning models of code
  • [2] Allamanis Miltiadis, 2016, ARXIV PREPRINT ARXIV
  • [3] Alon U, 2018, ACM SIGPLAN NOTICES, V53, P404, DOI [10.1145/3296979.3192412, 10.1145/3192366.3192412]
  • [4] [Anonymous], 2008, P NDSS
  • [5] [Anonymous], 2014, P 2014 INT S SOFTWAR
  • [6] Arora S., 2016, P INT C LEARN REPR
  • [7] Getafix: Learning to Fix Bugs Automatically
    Bader, Johannes
    Scott, Andrew
    Pradel, Michael
    Chandra, Satish
    [J]. PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, 2019, 3 (OOPSLA):
  • [8] Coverage-Based Greybox Fuzzing as Markov Chain
    Bohme, Marcel
    Van-Thuan Pham
    Roychoudhury, Abhik
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2019, 45 (05) : 489 - 506
  • [9] Where Is the Bug and How Is It Fixed? An Experiment with Practitioners
    Bohme, Marcel
    Soremekun, Ezekiel O.
    Chattopadhyay, Sudipta
    Ugherughe, Emamurho
    Zeller, Andreas
    [J]. ESEC/FSE 2017: PROCEEDINGS OF THE 2017 11TH JOINT MEETING ON FOUNDATIONS OF SOFTWARE ENGINEERING, 2017, : 117 - 128
  • [10] Bojanowski P., 2017, T ASSOC COMPUT LING, V5, P135, DOI [10.1162/tacl_a_00051, DOI 10.1162/TACLA00051]