FalconCode: A Multiyear Dataset of Python']Python Code Samples from an Introductory Computer Science Course

被引:7
作者
de Freitas, Adrian [1 ]
Coffman, Joel [1 ]
de Freitas, Michelle [2 ]
Wilson, Justin [1 ]
Weingart, Troy [1 ]
机构
[1] US Air Force Acad, Colorado Springs, CO 80840 USA
[2] Acad Dist 20, Colorado Springs, CO USA
来源
PROCEEDINGS OF THE 54TH ACM TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION, VOL 1, SIGCSE 2023 | 2023年
关键词
computer science education; student code repository; dataset;
D O I
10.1145/3545945.3569822
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The lack of large and diverse datasets of student code samples limits some forms of computer science education research. To address this problem, we created FalconCode, a novel collection of over 1.5 million Python programs from over two thousand undergraduate students at the United States Air Force Academy. FalconCode captures over five semesters worth of code samples from our introduction to computing course, which is taken by every student regardless of their academic major. The dataset contains student code submissions for over 800 programming assignments, as well as additional metadata such as the prompt for each assignment, the testcase(s) used to evaluate student submissions, and the specific skills needed to solve each problem. In this paper, we describe the methodology used to create FalconCode and the steps taken to anonymize the data. We then describe FalconCode's data schema, and show how it can support a wide range of research-including those utilizing machine learning (ML) and artificial intelligence (AI). FalconCode is provided free-of-charge, and is available upon request for computer science education research.
引用
收藏
页码:938 / 944
页数:7
相关论文
共 34 条
  • [1] A Dataset of Scratch Programs: Scraped, Shaped and Scored
    Aivaloglou, Efthimia
    Hermans, Felienne
    Moreno-Leon, Jesus
    Robles, Gregorio
    [J]. 2017 IEEE/ACM 14TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES (MSR 2017), 2017, : 511 - 514
  • [2] Arroyo I., 2010, Educational Data Mining 2010
  • [3] Gender Diversity in Computer Science at a Large Public R1 Research University: Reporting on a Self-study
    Babes-Vroman, Monica
    Nguyen, Thuytien N.
    Nguyen, Thu D.
    [J]. ACM TRANSACTIONS ON COMPUTING EDUCATION, 2022, 22 (02)
  • [4] Brown Neil C. C., 2020, SIGCSE '20: Proceedings of the 51st Technical Symposium on Computer Science Education, DOI 10.1145/3328778.3367006
  • [5] Blackbox, Five Years On: An Evaluation of a Large-scale Programming Data Collection Project
    Brown, Neil C. C.
    Altadmri, Amjad
    Sentance, Sue
    Kolling, Michael
    [J]. ICER'18: PROCEEDINGS OF THE 2018 ACM CONFERENCE ON INTERNATIONAL COMPUTING EDUCATION RESEARCH, 2018, : 196 - 204
  • [6] Blackbox: A Large Scale Repository of Novice Programmers' Activity
    Brown, Neil C. C.
    Kolling, Michael
    McCall, Davin
    Utting, Ian
    [J]. PROCEEDINGS OF THE 45TH ACM TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION (SIGCSE'14), 2014, : 223 - 228
  • [7] Brown Neil C.C., 2014, P 10 ANN C INT COMP, P43, DOI DOI 10.1145/2632320.2632343
  • [8] Identifying Challenging CS1 Concepts in a Large Problem Dataset
    Cherenkova, Yuliya
    Zingaro, Daniel
    Petersen, Andrew
    [J]. PROCEEDINGS OF THE 45TH ACM TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION (SIGCSE'14), 2014, : 695 - 700
  • [9] De Ruvo Giuseppe, 2018, P 20 AUSTRALASIAN CO, P73, DOI DOI 10.1145/3160489.3160500
  • [10] Edwards SH, 2009, FIFTH INTERNATIONAL COMPUTING EDUCATION RESEARCH WORKSHOP - ICER 2009, P3