FlakeFlagger: Predicting Flakiness Without Rerunning Tests

被引：55

作者：

Alshammari, Abdulrahman ^{[1
]}

Morris, Christopher ^{[2
]}

Hilton, Michael ^{[2
]}

Bell, Jonathan ^{[3
]}

机构：

[1] George Mason Univ, Fairfax, VA 22030 USA

[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[3] Northeastern Univ, Boston, MA 02115 USA

来源：

2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2021) | 2021年

关键词：

D O I：

10.1109/ICSE43902.2021.00140

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

When developers make changes to their code, they typically run regression tests to detect if their recent changes (re)introduce any bugs. However, many tests are flaky, and their outcomes can change non-deterministically, failing without apparent cause. Flaky tests are a significant nuisance in the development process, since they make it more difficult for developers to trust the outcome of their tests, and hence, it is important to know which tests are flaky. The traditional approach to identify flaky tests is to rerun them multiple times: if a test is observed both passing and failing on the same code, it is definitely flaky. We conducted a very large empirical study looking for flaky tests by rerunning the test suites of 24 projects 10,000 times each, and found that even with this many reruns, some previously identified flaky tests were still not detected. We propose FlakeFlagger, a novel approach that collects a set of features describing the behavior of each test, and then predicts tests that are likely to be flaky based on similar behavioral features. We found that FlakeFlagger correctly labeled as flaky at least as many tests as a state-of-the-art flaky test classifier, but that FlakeFlagger reported far fewer false positives. This lower false positive rate translates directly to saved time for researchers and developers who use the classification result to guide more expensive flaky test detection processes. Evaluated on our dataset of 23 projects with flaky tests, FlakeFlagger outperformed the prior approach (by F1 score) on 16 projects and tied on 4 projects. Our results indicate that this approach can be effective for identifying likely flaky tests prior to running time-consuming flaky test detectors.

引用

页码：1572 / 1584

页数：13

共 58 条

[1] Ahmad Azeem, 2019, ARXIV PREPRINT ARXIV
[2] [Anonymous], 2020, SCIKIT
[3] [Anonymous], 2017, DO OUR FLAKY TESTS C
[4] Machine learning techniques for code smell detection: A systematic literature review and meta-analysis
Azeem, Muhammad Ilyas
Palomba, Fabio
Shi, Lin
Wang, Qing
[J]. INFORMATION AND SOFTWARE TECHNOLOGY, 2019, 108 : 115 - 138
[5] Are test smells really harmful? An empirical study
Bavota, Gabriele
Qusef, Abdallah
Oliveto, Rocco
De Lucia, Andrea
Binkley, Dave
[J]. EMPIRICAL SOFTWARE ENGINEERING, 2015, 20 (04) : 1052 - 1094
[6] DEFLAKER: Automatically Detecting Flaky Tests
Bell, Jonathan
Legunsen, Owolabi
Hilton, Michael
Eloussi, Lamyaa
Yung, Tifany
Marinov, Darko
[J]. PROCEEDINGS 2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), 2018, : 433 - 444
[7] Unit Test Virtualization with VMVM
Bell, Jonathan
Kaiser, Gail
[J]. 36TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2014), 2014, : 550 - 561
[8] Efficient Dependency Detection for Safe Java']Java Test Acceleration
Bell, Jonathan
Kaiser, Gail
Melski, Eric
Dattatreya, Mohan
[J]. 2015 10TH JOINT MEETING OF THE EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND THE ACM SIGSOFT SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (ESEC/FSE 2015) PROCEEDINGS, 2015, : 770 - 781
[9] Bengio Y, 2004, J MACH LEARN RES, V5, P1089
[10] Breugelmans Manuel, 2008, 1 INT WORKSH ADV SOF

← 1 2 3 4 5 6 →