Who Judges the Judge: An Empirical Study on Online Judge Tests

被引:3
作者
Liu, Kaibo [1 ]
Han, Yudong [1 ]
Zhang, Jie M. [2 ]
Chen, Zhenpeng [3 ]
Sarro, Federica [3 ]
Harman, Mark [3 ]
Huang, Gang [4 ]
Ma, Yun [1 ]
机构
[1] Peking Univ, Beijing, Peoples R China
[2] Kings Coll London, London, England
[3] UCL, London, England
[4] Peking Univ, Natl Key Lab Data Space Technol & Syst, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 32ND ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON SOFTWARE TESTING AND ANALYSIS, ISSTA 2023 | 2023年
关键词
Online judge platform; software testing; test assessment;
D O I
10.1145/3597926.3598060
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Online Judge platforms play a pivotal role in education, competitive programming, recruitment, career training, and large language model training. They rely on predefined test suites to judge the correctness of submitted solutions. It is therefore important that the solution judgement is reliable and free from potentially misleading false positives (i.e., incorrect solutions that are judged as correct). In this paper, we conduct an empirical study of 939 coding problems with 541,552 solutions, all of which are judged to be correct according to the test suites used by the platform, finding that 43.4% of the problems include false positive solutions (3,440 bugs are revealed in total). We also find that test suites are, nevertheless, of high quality according to widely-studied test effectiveness measurements: 88.2% of false positives have perfect (100%) line coverage, 78.9% have perfect branch coverage, and 32.5% have a perfect mutation score. Our findings indicate that more work is required to weed out false positive solutions and to further improve test suite effectiveness. We have released the detected false positive solutions and the generated test inputs to facilitate future research.
引用
收藏
页码:334 / 346
页数:13
相关论文
共 77 条
  • [1] Using mutation analysis for assessing and comparing testing coverage criteria
    Andrews, James H.
    Briand, Lionel C.
    Labiche, Yvan
    Namin, Akbar Siami
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2006, 32 (08) : 608 - 624
  • [2] [Anonymous], 2015, EDM WORKSH
  • [3] [Anonymous], Tabnine
  • [4] [Anonymous], GitHub Copilot
  • [5] [Anonymous], 2015, LeetCode-The world's leading online programming learning platform
  • [6] [Anonymous], 2022, Top website designers, developers, freelancers for your next project
  • [7] AtCoder Inc, 2012, AtCoder
  • [8] AtCoder Inc, 2016, Atcoder testcases
  • [9] An Introduction to Software Testing
    Baresi, Luciano
    Pezze, Mauro
    [J]. ELECTRONIC NOTES IN THEORETICAL COMPUTER SCIENCE, 2006, 148 (01) : 89 - 111
  • [10] What It Would Take to Use Mutation Testing in Industry-A Study at Facebook
    Beller, Moritz
    Wong, Chu-Pan
    Bader, Johannes
    Scott, Andrew
    Machalica, Mateusz
    Chandra, Satish
    Meijer, Erik
    [J]. 2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: SOFTWARE ENGINEERING IN PRACTICE (ICSE-SEIP 2021), 2021, : 268 - 277