The Art and Practice of Data Science Pipelines

被引:21
作者
Biswas, Sumon [1 ]
Wardat, Mohammad [1 ]
Rajan, Hridesh [1 ]
机构
[1] Iowa State Univ, Ames, IA 50011 USA
来源
2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022) | 2022年
关键词
data science pipelines; data science processes; descriptive; predictive; CHALLENGES;
D O I
10.1145/3510003.3510057
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Increasingly larger number of software systems today are including data science components for descriptive, predictive, and prescriptive analytics. The collection of data science stages from acquisition, to cleaning/curation, to modeling, and so on are referred to as data science pipelines. To facilitate research and practice on data science pipelines, it is essential to understand their nature. What are the typical stages of a data science pipeline? How are they connected? Do the pipelines differ in the theoretical representations and that in the practice? Today we do not fully understand these architectural characteristics of data science pipelines. In this work, we present a three-pronged comprehensive study to answer this for the stateof-the-art, data science in-the-small, and data science in-the-large. Our study analyzes three datasets: a collection of 71 proposals for data science pipelines and related concepts in theory, a collection of over 105 implementations of curated data science pipelines from Kaggle competitions to understand data science in-the-small, and a collection of 21 mature data science projects from GitHub to understand data science in-the-large. Our study has led to three representations of data science pipelines that capture the essence of our subjects in theory, in-the-small, and in-the-large.
引用
收藏
页码:2091 / 2103
页数:13
相关论文
共 91 条
[1]  
Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2]  
Abdulla W., 2017, Mask R-CNN for object detection and instance segmentation on keras and tensorflow
[3]   Software Engineering for Machine Learning: A Case Study [J].
Amershi, Saleema ;
Begel, Andrew ;
Bird, Christian ;
DeLine, Robert ;
Gall, Harald ;
Kamar, Ece ;
Nagappan, Nachiappan ;
Nushi, Besmira ;
Zimmermann, Thomas .
2019 IEEE/ACM 41ST INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: SOFTWARE ENGINEERING IN PRACTICE (ICSE-SEIP 2019), 2019, :291-300
[4]  
[Anonymous], 2018, A TensorFlow Implementation of Deep Convolutional Generative Adversarial Networks
[5]  
[Anonymous], 2019, MACH LEARN WORKFL
[6]  
[Anonymous], 2019, WHAT AR ML PIP AZ MA
[7]  
[Anonymous], 2021, DATA SCI PIPLINE ART
[8]  
Arriaga Octavio, 2018, Face classification and detectionn
[9]   Assuring the Machine Learning Lifecycle: Desiderata, Methods, and Challenges [J].
Ashmore, Rob ;
Calinescu, Radu ;
Paterson, Colin .
ACM COMPUTING SURVEYS, 2021, 54 (05)
[10]  
Aungiers Jakob, 2019, Lstm neural network for time series prediction