DataRinse: Semantic Transforms for Data preparation based on Code Mining

被引:1
作者
Abdelaziz, Ibrahim [1 ]
Dolby, Julian [1 ]
Khurana, Udayan [1 ]
Samulowitz, Horst [1 ]
Srinivas, Kavitha [1 ]
机构
[1] IBM Res, Yorktown Hts, NY 10598 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2023年 / 16卷 / 12期
关键词
D O I
10.14778/3611540.3611628
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data preparation is a crucial first step to any data analysis problem. This task is largely manual, performed by a person familiar with the data domain. DataRinse is a system designed to extract relevant transforms from large scale static analysis of repositories of code. Our motivation is that in any large enterprise, multiple personas such as data engineers and data scientists work on similar datasets. However, sharing or re-using that code is not obvious and difficult to execute. In this paper, we demonstrate DataRinse to handle data preparation, such that the system recommends code designed to help with the preparation of a column for data analysis more generally. We show that DataRinse does not simply shard expressions observed in code but also uses analysis to group expressions applied to the same field such that related transforms appear coherently to a user. It is a human-in-the-loop system where the users select relevant code snippets produced by DataRinse to apply on their dataset.
引用
收藏
页码:4090 / 4093
页数:4
相关论文
共 9 条
  • [1] A Toolkit for Generating Code Knowledge Graphs
    Abdelaziz, Ibrahim
    Dolby, Julian
    McCusker, Jamie
    Srinivas, Kavitha
    [J]. PROCEEDINGS OF THE 11TH KNOWLEDGE CAPTURE CONFERENCE (K-CAP '21), 2021, : 137 - 144
  • [2] VizSmith: Automated Visualization Synthesis by Mining Data-Science Notebooks
    Bavishi, Rohan
    Laddad, Shadaj
    Yoshida, Hiroaki
    Prasad, Mukul R.
    Sen, Koushik
    [J]. 2021 36TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING ASE 2021, 2021, : 129 - 141
  • [3] Cambronero Jose Pablo, 2022, wranglesearch: Mining Data Wrangling Functions from Python Programs
  • [4] Transform-Data-by-Example (TDE): An Extensible Search Engine for Data Transformations
    He, Yeye
    Chu, Xu
    Ganjam, Kris
    Zheng, Yudian
    Narasayya, Vivek
    Chaudhuri, Surajit
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2018, 11 (10): : 1165 - 1177
  • [5] Foofah: Transforming Data By Example
    Jin, Zhongjun
    Anderson, Michael R.
    Cafarella, Michael
    Jagadish, H., V
    [J]. SIGMOD'17: PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2017, : 683 - 698
  • [6] Semantic Concept Annotation for Tabular Data
    Khurana, Udayan
    Galhotra, Sainyam
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 844 - 853
  • [7] Le V, 2014, ACM SIGPLAN NOTICES, V49, P542, DOI [10.1145/2594291.2594333, 10.1145/2666356.2594333]
  • [8] Narayan A, 2022, Arxiv, DOI arXiv:2205.09911
  • [9] Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks
    Yan, Cong
    He, Yeye
    [J]. SIGMOD'20: PROCEEDINGS OF THE 2020 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2020, : 1539 - 1554