Subtle Bugs Everywhere: Generating Documentation for Data Wrangling Code

被引:9
作者
Yang, Chenyang [1 ]
Zhou, Shurui [2 ]
Guo, Jin L. C. [3 ]
Kastner, Christian [4 ]
机构
[1] Peking Univ, Beijing, Peoples R China
[2] Univ Toronto, Toronto, ON, Canada
[3] McGill Univ, Montreal, PQ, Canada
[4] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
来源
2021 36TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING ASE 2021 | 2021年
基金
加拿大自然科学与工程研究理事会; 美国国家科学基金会;
关键词
computational notebook; data wrangling; code comprehension; code summarization;
D O I
10.1109/ASE51524.2021.9678520
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Data scientists reportedly spend a significant amount of their time in their daily routines on data wrangling, i.e. cleaning data and extracting features. However, data wrangling code is often repetitive and error-prone to write. Moreover, it is easy to introduce subtle bugs when reusing and adopting existing code, which results in reduced model quality. To support data scientists with data wrangling, we present a technique to generate documentation for data wrangling code. We use (1) program synthesis techniques to automatically summarize data transformations and (2) test case selection techniques to purposefully select representative examples from the data based on execution information collected with tailored dynamic program analysis. We demonstrate that a JupyterLab extension with our technique can provide on-demand documentation for many cells in popular notebooks and find in a user study that users with our plugin are faster and more effective at finding realistic bugs in data wrangling code.
引用
收藏
页码:304 / 316
页数:13
相关论文
共 61 条
  • [1] Allamanis M, 2016, PR MACH LEARN RES, V48
  • [2] Software Engineering for Machine Learning: A Case Study
    Amershi, Saleema
    Begel, Andrew
    Bird, Christian
    DeLine, Robert
    Gall, Harald
    Kamar, Ece
    Nagappan, Nachiappan
    Nushi, Besmira
    Zimmermann, Thomas
    [J]. 2019 IEEE/ACM 41ST INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: SOFTWARE ENGINEERING IN PRACTICE (ICSE-SEIP 2019), 2019, : 291 - 300
  • [3] [Anonymous], 2004, Principles of program analysis
  • [4] [Anonymous], 2018, P 17 PYTH SCI C, DOI [10.25080/Majora-4af1f417-011, DOI 10.25080/MAJORA-4AF1F417-011]
  • [5] Helping Developers Help Themselves: Automatic Decomposition of Code Review Changesets
    Barnett, Mike
    Bird, Christian
    Brunet, Joao
    Lahiri, Shuvendu K.
    [J]. 2015 IEEE/ACM 37TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, VOL 1, 2015, : 134 - 144
  • [6] Box G. E. P., 2005, STAT EXPT DESIGN INN
  • [7] What's Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities
    Chattopadhyay, Souti
    Prasad, Ishita
    Henley, Austin Z.
    Sarma, Anita
    Barik, Titus
    [J]. PROCEEDINGS OF THE 2020 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS (CHI'20), 2020,
  • [8] Dgomonov, 2019, NEW YORK CIT AIRBNB
  • [9] Drosos I, 2020, PROCEEDINGS OF THE 2020 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS (CHI'20)
  • [10] Autofolding for Source Code Summarization
    Fowkes, Jaroslav
    Chanthirasegaran, Pankajan
    Ranca, Razvan
    Allamanis, Miltiadis
    Lapata, Mirella
    Sutton, Charles
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2017, 43 (12) : 1095 - 1109