Flakify: A Black-Box, Language Model-Based Predictor for Flaky Tests

被引：22

作者：

Fatima, Sakina ^{[1
]}

Ghaleb, Taher A. ^{[1
]}

Briand, Lionel ^{[1
,2
]}

机构：

[1] Univ Ottawa, Sch EECS, Ottawa, ON K1N 6N5, Canada

[2] Univ Luxembourg, SnT Ctr Secur Reliabil & Trust, L-4365 Esch Sur Alzette, Luxembourg

来源：

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING | 2023年 / 49卷 / 04期

基金：

加拿大自然科学与工程研究理事会;

关键词：

Flaky tests; software testing; black-box testing; natural language processing; CodeBERT;

D O I：

10.1109/TSE.2022.3201209

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Software testing assures that code changes do not adversely affect existing functionality. However, a test case can be flaky, i.e., passing and failing across executions, even for the same version of the source code. Flaky test cases introduce overhead to software development as they can lead to unnecessary attempts to debug production or testing code. Besides rerunning test cases multiple times, which is time-consuming and computationally expensive, flaky test cases can be predicted using machine learning (ML) models, thus reducing the wasted cost of re-running and debugging these test cases. However, the state-of-the-art ML-based flaky test case predictors rely on pre-defined sets of features that are either project-specific, i.e., inapplicable to other projects, or require access to production code, which is not always available to software test engineers. Moreover, given the non-deterministic behavior of flaky test cases, it can be challenging to determine a complete set of features that could potentially be associated with test flakiness. Therefore, in this article, we propose Flakify, a black-box, language model-based predictor for flaky test cases. Flakify relies exclusively on the source code of test cases, thus not requiring to (a) access to production code (black-box), (b) rerun test cases, (c) pre-define features. To this end, we employed CodeBERT, a pre-trained language model, and fine-tuned it to predict flaky test cases using the source code of test cases. We evaluated Flakify on two publicly available datasets (FlakeFlagger and IDoFT) for flaky test cases and compared our technique with the FlakeFlagger approach, the best state-of-the-art ML-based, white-box predictor for flaky test cases, using two different evaluation procedures: (1) cross-validation and (2) per-project validation, i.e., prediction on new projects. Flakify achieved F1-scores of 79% and 73% on the FlakeFlagger dataset using cross-validation and per-project validation, respectively. Similarly, Flakify achieved F1-scores of 98% and 89% on the IDoFT dataset using the two validation procedures, respectively. Further, Flakify surpassed FlakeFlagger by 10 and 18 percentage points (pp) in terms of precision and recall, respectively, when evaluated on the FlakeFlagger dataset, thus reducing the cost bound to be wasted on unnecessarily debugging test cases and production code by the same percentages (corresponding to reduction rates of 25% and 64%). Flakify also achieved significantly higher prediction results when used to predict test cases on new projects, suggesting better generalizability over FlakeFlagger. Our results further show that a black-box version of FlakeFlagger is not a viable option for predicting flaky test cases.

引用

页码：1912 / 1927

页数：16

共 69 条

[1]

Agarap A F., 2018, Deep learning using rectified linear units (relu)

[2] Test Smell Detection Tools: A Systematic Mapping Study [J].

Aljedaani, Wajdi ;

Peruma, Anthony ;

Aljohani, Ahmed ;

Alotaibi, Mazen ;

Mkaouer, Mohamed Wiem ;

Ouni, Ali ;

Newman, Christian D. ;

Ghallab, Abdullatif ;

Ludi, Stephanie .

PROCEEDINGS OF EVALUATION AND ASSESSMENT IN SOFTWARE ENGINEERING (EASE 2021), 2021, :170-180

[3] FlakeFlagger: Predicting Flakiness Without Rerunning Tests [J].

Alshammari, Abdulrahman ;

Morris, Christopher ;

Hilton, Michael ;

Bell, Jonathan .

2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2021), 2021, :1572-1584

[4]

[Anonymous], 2001, ADAPT LEARN SYST SIG

[5]

Bach N., 2007, Literature review for Language and Statistics II, V2, P1

[6] Coverage-Based Reduction of Test Execution Time: Lessons from a Very Large Industrial Project [J].

Bach, Thomas ;

Andrzejak, Artur ;

Pannemans, Ralf .

10TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE TESTING, VERIFICATION AND VALIDATION WORKSHOPS - ICSTW 2017, 2017, :3-12

[7] DEFLAKER: Automatically Detecting Flaky Tests [J].

Bell, Jonathan ;

Legunsen, Owolabi ;

Hilton, Michael ;

Eloussi, Lamyaa ;

Yung, Tifany ;

Marinov, Darko .

PROCEEDINGS 2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), 2018, :433-444

[8] A Survey of Predictive Modeling on Im balanced Domains [J].

Branco, Paula ;

Torgo, Luis ;

Ribeiro, Rita P. .

ACM COMPUTING SURVEYS, 2016, 49 (02)

[9]

Camara Bruno, 2021, SAST'21: Brazilian Symposium on Systematic and Automated Software Testing, P46, DOI 10.1145/3482909.3482916

[10] SMOTE: Synthetic minority over-sampling technique [J].

Chawla, Nitesh V. ;

Bowyer, Kevin W. ;

Hall, Lawrence O. ;

Kegelmeyer, W. Philip .

2002, American Association for Artificial Intelligence (16)

← 1 2 3 4 5 6 7 →