KGTorrent: A Dataset of Python']Python Jupyter Notebooks from Kaggle

被引:31
作者
Quaranta, Luigi [1 ]
Calefato, Fabio [1 ]
Lanubile, Filippo [1 ]
机构
[1] Univ Bari, Bari, Italy
来源
2021 IEEE/ACM 18TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES (MSR 2021) | 2021年
关键词
open dataset; repository; Kaggle; computational notebook; Jupyter;
D O I
10.1109/MSR52588.2021.00072
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Computational notebooks have become the tool of choice for many data scientists and practitioners for performing analyses and disseminating results. Despite their increasing popularity, the research community cannot yet count on a large, curated dataset of computational notebooks. In this paper, we fill this gap by introducing KGTORRENT, a dataset of Python Jupyter notebooks with rich metadata retrieved from Kaggle, a platform hosting data science competitions for learners and practitioners with any levels of expertise. We describe how we built KGTORRENT, and provide instructions on how to use it and refresh the collection to keep it up to date. Our vision is that the research community will use KGTORRENT to study how data scientists, especially practitioners, use Jupyter Notebook in the wild and identify potential shortcomings to inform the design of its future extensions.
引用
收藏
页码:550 / 554
页数:5
相关论文
共 14 条
[1]   SOTorrent: Reconstructing and Analyzing the Evolution of Stack Overflow Posts [J].
Baltes, Sebastian ;
Dumani, Lorik ;
Treude, Christoph ;
Diehl, Stephan .
2018 IEEE/ACM 15TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES (MSR), 2018, :319-330
[2]   What's Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities [J].
Chattopadhyay, Souti ;
Prasad, Ishita ;
Henley, Austin Z. ;
Sarma, Anita ;
Barik, Titus .
PROCEEDINGS OF THE 2020 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS (CHI'20), 2020,
[3]   Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks [J].
Yan, Cong ;
He, Yeye .
SIGMOD'20: PROCEEDINGS OF THE 2020 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2020, :1539-1554
[4]  
Gousios Georgios, 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR 2012), P12, DOI 10.1109/MSR.2012.6224294
[5]  
Grus J., 2018, JUPYTERCON OFF JUP C
[6]   Assessing and Restoring Reproducibility of Jupyter Notebooks [J].
Wang, Jiawei ;
Kuo, Tzu-yang ;
Li, Li ;
Zeller, Andreas .
2020 35TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE 2020), 2020, :138-149
[7]   LITERATE PROGRAMMING [J].
KNUTH, DE .
COMPUTER JOURNAL, 1984, 27 (02) :97-111
[8]  
Perkel JM, 2018, NATURE, V563, P145, DOI 10.1038/d41586-018-07196-1
[9]  
Pimentel Joao Felipe, 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), P507, DOI 10.1109/MSR.2019.00077
[10]   Exploration and Explanation in Computational Notebooks [J].
Rule, Adam ;
Tabard, Aurelien ;
Hollan, James D. .
PROCEEDINGS OF THE 2018 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS (CHI 2018), 2018,