Syntax and Stack Overflow: A methodology for extracting a corpus of syntax errors and fixes

被引:6
作者
Wong, Alexander William [1 ]
Salimi, Amir [1 ]
Chowdhury, Shaiful [1 ]
Hindle, Abram [1 ]
机构
[1] Univ Alberta, Dept Comp Sci, Edmonton, AB, Canada
来源
2019 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME 2019) | 2019年
关键词
stack overflow; natural; syntax errors; !text type='python']python[!/text; mining software repositories;
D O I
10.1109/ICSME.2019.00048
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
One problem when studying how to find and fix syntax errors is how to get natural and representative examples of syntax errors. Most syntax error datasets are not free, open, and public, or they are extracted from novice programmers and do not represent syntax errors that the general population of developers would make. Programmers of all skill levels post questions and answers to Stack Overflow which may contain snippets of source code along with corresponding text and tags. Many snippets do not parse, thus they are ripe for forming a corpus of syntax errors and corrections. Our primary contribution is an approach for extracting natural syntax errors and their corresponding human made fixes to help syntax error research. A Python abstract syntax tree parser is used to determine preliminary errors and corrections on code blocks extracted from the SOTorrent data set. We further analyzed our code by executing the corrections in a Python interpreter. We applied our methodology to produce a public data set of 62,965 Python Stack Overflow code snippets with corresponding tags, errors, and stack traces. We found that errors made by Stack Overflow users do not match errors made by student developers or random mutations, implying there is a serious representativeness risk within the field. Finally we share our dataset openly so that future researchers can re-use and extend our syntax errors and fixes.
引用
收藏
页码:318 / 322
页数:5
相关论文
共 14 条
[1]  
[Anonymous], PEERJ PREPRINTS
[2]  
[Anonymous], THESIS
[3]   SOTorrent: Reconstructing and Analyzing the Evolution of Stack Overflow Posts [J].
Baltes, Sebastian ;
Dumani, Lorik ;
Treude, Christoph ;
Diehl, Stephan .
2018 IEEE/ACM 15TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES (MSR), 2018, :319-330
[4]   Blackbox: A Large Scale Repository of Novice Programmers' Activity [J].
Brown, Neil C. C. ;
Kolling, Michael ;
McCall, Davin ;
Utting, Ian .
PROCEEDINGS OF THE 45TH ACM TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION (SIGCSE'14), 2014, :223-228
[5]  
Campbell Joshua Charles, 2014, P 11 WORK C MIN SOFT, P252
[6]  
Denny Paul, 2012, P 17 ACM ANN C INN T, P75, DOI DOI 10.1145/2325296.2325318
[7]  
Hindle A, 2012, PROC INT CONF SOFTW, P837, DOI 10.1109/ICSE.2012.6227135
[8]   Are mutants really natural? A study on how "naturalness" helps mutant selection [J].
Jimenez, Matthieu ;
Checkam, Thiery Titcheu ;
Cordy, Maxime ;
Papadakis, Mike ;
Kintis, Marinos ;
Le Traon, Yves ;
Harman, Mark .
PROCEEDINGS OF THE 12TH ACM/IEEE INTERNATIONAL SYMPOSIUM ON EMPIRICAL SOFTWARE ENGINEERING AND MEASUREMENT (ESEM 2018), 2018,
[9]   Are Mutants a Valid Substitute for Real Faults in Software Testing? [J].
Just, Rene ;
Jalali, Darioush ;
Inozemtseva, Laura ;
Ernst, Michael D. ;
Holmes, Reid ;
Fraser, Gordon .
22ND ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (FSE 2014), 2014, :654-665
[10]  
Liu K., 2019, 41 ACMIEEE INT C SOF, P1