FlakeFlagger: Predicting Flakiness Without Rerunning Tests

被引:55
作者
Alshammari, Abdulrahman [1 ]
Morris, Christopher [2 ]
Hilton, Michael [2 ]
Bell, Jonathan [3 ]
机构
[1] George Mason Univ, Fairfax, VA 22030 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[3] Northeastern Univ, Boston, MA 02115 USA
来源
2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2021) | 2021年
关键词
D O I
10.1109/ICSE43902.2021.00140
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
When developers make changes to their code, they typically run regression tests to detect if their recent changes (re)introduce any bugs. However, many tests are flaky, and their outcomes can change non-deterministically, failing without apparent cause. Flaky tests are a significant nuisance in the development process, since they make it more difficult for developers to trust the outcome of their tests, and hence, it is important to know which tests are flaky. The traditional approach to identify flaky tests is to rerun them multiple times: if a test is observed both passing and failing on the same code, it is definitely flaky. We conducted a very large empirical study looking for flaky tests by rerunning the test suites of 24 projects 10,000 times each, and found that even with this many reruns, some previously identified flaky tests were still not detected. We propose FlakeFlagger, a novel approach that collects a set of features describing the behavior of each test, and then predicts tests that are likely to be flaky based on similar behavioral features. We found that FlakeFlagger correctly labeled as flaky at least as many tests as a state-of-the-art flaky test classifier, but that FlakeFlagger reported far fewer false positives. This lower false positive rate translates directly to saved time for researchers and developers who use the classification result to guide more expensive flaky test detection processes. Evaluated on our dataset of 23 projects with flaky tests, FlakeFlagger outperformed the prior approach (by F1 score) on 16 projects and tied on 4 projects. Our results indicate that this approach can be effective for identifying likely flaky tests prior to running time-consuming flaky test detectors.
引用
收藏
页码:1572 / 1584
页数:13
相关论文
共 58 条
  • [1] Ahmad Azeem, 2019, ARXIV PREPRINT ARXIV
  • [2] [Anonymous], 2020, SCIKIT
  • [3] [Anonymous], 2017, DO OUR FLAKY TESTS C
  • [4] Machine learning techniques for code smell detection: A systematic literature review and meta-analysis
    Azeem, Muhammad Ilyas
    Palomba, Fabio
    Shi, Lin
    Wang, Qing
    [J]. INFORMATION AND SOFTWARE TECHNOLOGY, 2019, 108 : 115 - 138
  • [5] Are test smells really harmful? An empirical study
    Bavota, Gabriele
    Qusef, Abdallah
    Oliveto, Rocco
    De Lucia, Andrea
    Binkley, Dave
    [J]. EMPIRICAL SOFTWARE ENGINEERING, 2015, 20 (04) : 1052 - 1094
  • [6] DEFLAKER: Automatically Detecting Flaky Tests
    Bell, Jonathan
    Legunsen, Owolabi
    Hilton, Michael
    Eloussi, Lamyaa
    Yung, Tifany
    Marinov, Darko
    [J]. PROCEEDINGS 2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), 2018, : 433 - 444
  • [7] Unit Test Virtualization with VMVM
    Bell, Jonathan
    Kaiser, Gail
    [J]. 36TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2014), 2014, : 550 - 561
  • [8] Efficient Dependency Detection for Safe Java']Java Test Acceleration
    Bell, Jonathan
    Kaiser, Gail
    Melski, Eric
    Dattatreya, Mohan
    [J]. 2015 10TH JOINT MEETING OF THE EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND THE ACM SIGSOFT SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (ESEC/FSE 2015) PROCEEDINGS, 2015, : 770 - 781
  • [9] Bengio Y, 2004, J MACH LEARN RES, V5, P1089
  • [10] Breugelmans Manuel, 2008, 1 INT WORKSH ADV SOF