Who Judges the Judge: An Empirical Study on Online Judge Tests

被引：3

作者：

Liu, Kaibo ^{[1
]}

Han, Yudong ^{[1
]}

Zhang, Jie M. ^{[2
]}

Chen, Zhenpeng ^{[3
]}

Sarro, Federica ^{[3
]}

Harman, Mark ^{[3
]}

Huang, Gang ^{[4
]}

Ma, Yun ^{[1
]}

机构：

[1] Peking Univ, Beijing, Peoples R China

[2] Kings Coll London, London, England

[3] UCL, London, England

[4] Peking Univ, Natl Key Lab Data Space Technol & Syst, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 32ND ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON SOFTWARE TESTING AND ANALYSIS, ISSTA 2023 | 2023年

关键词：

Online judge platform; software testing; test assessment;

D O I：

10.1145/3597926.3598060

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Online Judge platforms play a pivotal role in education, competitive programming, recruitment, career training, and large language model training. They rely on predefined test suites to judge the correctness of submitted solutions. It is therefore important that the solution judgement is reliable and free from potentially misleading false positives (i.e., incorrect solutions that are judged as correct). In this paper, we conduct an empirical study of 939 coding problems with 541,552 solutions, all of which are judged to be correct according to the test suites used by the platform, finding that 43.4% of the problems include false positive solutions (3,440 bugs are revealed in total). We also find that test suites are, nevertheless, of high quality according to widely-studied test effectiveness measurements: 88.2% of false positives have perfect (100%) line coverage, 78.9% have perfect branch coverage, and 32.5% have a perfect mutation score. Our findings indicate that more work is required to weed out false positive solutions and to further improve test suite effectiveness. We have released the detected false positive solutions and the generated test inputs to facilitate future research.

引用

页码：334 / 346

页数：13

共 77 条

[51] SEMCLUSTER: Clustering of Imperative Programming Assignments Based on Quantitative Semantic Features
Perry, David M.
Kim, Dohyeong
Samanta, Roopsha
Zhang, Xiangyu
[J]. PROCEEDINGS OF THE 40TH ACM SIGPLAN CONFERENCE ON PROGRAMMING LANGUAGE DESIGN AND IMPLEMENTATION (PLDI '19), 2019, : 860 - 873
[52] Does mutation testing improve testing practices?
Petrovic, Goran
Ivankovic, Marko
Fraser, Gordon
Just, Rene
[J]. 2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2021), 2021, : 910 - 921
[53] PITest contributors, 2022, Mutation operators
[54] Is 100% Test Coverage a Reasonable Requirement? Lessons Learned from a Space Software Project
Prause, Christian R.
Werner, Juergen
Hornig, Kay
Bosecker, Sascha
Kuhrmann, Marco
[J]. PRODUCT-FOCUSED SOFTWARE PROCESS IMPROVEMENT (PROFES 2017), 2017, 10611 : 351 - 367
[55] Revilla M. A., 2008, Olympiads Inf., V2, P131
[56] Mutation testing in the wild: findings from GitHub
Sanchez, Ana B.
Delgado-Perez, Pedro
Medina-Bulo, Inmaculada
Segura, Sergio
[J]. EMPIRICAL SOFTWARE ENGINEERING, 2022, 27 (06)
[57] Correlation Coefficients: Appropriate Use and Interpretation
Schober, Patrick
Boer, Christa
Schwarte, Lothar A.
[J]. ANESTHESIA AND ANALGESIA, 2018, 126 (05) : 1763 - 1768
[58] Context-Aware and Data-Driven Feedback Generation for Programming Assignments
Song, Dowon
Lee, Woosuk
Oh, Hakjoo
[J]. PROCEEDINGS OF THE 29TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (ESEC/FSE '21), 2021, : 328 - 340
[59] A Survey on Data-Flow Testing
Su, Ting
Wu, Ke
Miao, Weikai
Pu, Geguang
He, Jifeng
Chen, Yuting
Su, Zhendong
[J]. ACM COMPUTING SURVEYS, 2017, 50 (01)
[60] Improving Machine Translation Systems via Isotopic Replacement
Sun, Zeyu
Zhang, Jie M.
Xiong, Yingfei
Harman, Mark
Papadakis, Mike
Zhang, Lu
[J]. 2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), 2022, : 1181 - 1192

← 1 2 3 4 5 6 7 8 →