Auto-Prep: Efficient and Automated Data Preprocessing Pipeline

被引：14

作者：

Bilal, Mehwish ^{[1
]}

Ali, Ghulam ^{[1
]}

Iqbal, Muhammad Waseem ^{[2
]}

Anwar, Muhammad ^{[3
]}

Malik, Muhammad Sheraz Arshad ^{[4
]}

Kadir, Rabiah Abdul ^{[5
]}

机构：

[1] Univ Okara, Dept Comp Sci, Okara 56300, Pakistan

[2] Super Univ Lahore, Dept Software Engn, Lahore 54000, Pakistan

[3] Univ Educ, Dept Informat Sci, Div Sci & Technol, Lahore 54000, Pakistan

[4] Govt Coll Univ Faisalabad, Dept Informat Technol, Faisalabad 38000, Pakistan

[5] Univ Kebangsaan Malaysia, Inst IR4 0, Bangi 43600, Selangor, Malaysia

来源：

IEEE ACCESS | 2022年 / 10卷

关键词：

Encoding; Data preprocessing; Machine learning; Feature extraction; Data models; Dimensionality reduction; Support vector machines; Pipelines; Automated machine learning; data preprocessing; feature engineering; DIMENSIONALITY; IMPUTATION; SELECTION;

D O I：

10.1109/ACCESS.2022.3198662

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Data preprocessing is crucial in the Machine Learning pipeline because the models' learning ability directly affects the quality of data and the underlying information acquired from this stage. Nevertheless, surprisingly, there are many alternatives for each transformation task, which makes an inexperienced user overwhelmed. A simple Python-based Auto-preprocessing architecture for Automated Machine Learning is developed to offer automated, interactive, and data-driven support to help the users perform data preprocessing tasks efficiently. The suggested method provides valuable insights into a dataset and can handle standard data preprocessing tasks adeptly. Initially, it detects the data problem and presents it to the end-user using compelling visualizations. Then, it recommends the most effective data cleaning and preparation method to the user after evaluating the state-of-the-art candidate techniques. For evaluation, the proposed architecture is employed on ten different and diverse datasets for automatic data preprocessing before passing it to an ML algorithm. The results are then compared with the results generated by the same ML algorithm but implemented on manually preprocessed data. The results have shown that not only did this approach make the whole process uncomplicated and facile, but it was also able to improve the performance of the model significantly.

引用

页码：107764 / 107784

页数：21

共 51 条

[1] Effect of Data Scaling Methods on Machine Learning Algorithms and Model Performance
Ahsan, Md Manjurul
Mahmud, M. A. Parvez
Saha, Pritom Kumar
Gupta, Kishor Datta
Siddique, Zahed
[J]. TECHNOLOGIES, 2021, 9 (03)
[2] Alasadi W. S., 2017, J. Eng. Appl. Sci., V12, P4102, DOI DOI 10.3923/JEASCI.2017.4102.4107
[3] Ambarwari A., 2020, Rekayasa Sistem Dan Teknologi Informasi, V4, P117, DOI [10.29207/resti.v4i1.1517, DOI 10.29207/RESTI.V4I1.1517]
[4] Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools
Anh Truong
Walters, Austin
Goodsitt, Jeremy
Hines, Keegan
Bruss, C. Bayan
Farivar, Reza
[J]. 2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019), 2019, : 1471 - 1479
[5] Comparison of Temporal and Non-Temporal Features Effect on Machine Learning Models Quality and Interpretability for Chronic Heart Failure Patients
Balabaeva, Ksenia
Kovalchuk, Sergey
[J]. 8TH INTERNATIONAL YOUNG SCIENTISTS CONFERENCE ON COMPUTATIONAL SCIENCE, YSC2019, 2019, 156 : 87 - 96
[6] Barnes J., 2015, AZURE MACHINE LEARNI
[7] Basak S, 2020, IEEE SYS MAN CYBERN, P928, DOI [10.1109/SMC42975.2020.9282834, 10.1109/smc42975.2020.9282834]
[8] Bennett DA, 2001, AUST NZ J PUBL HEAL, V25, P464, DOI 10.1111/j.1467-842X.2001.tb00294.x
[9] Automated Data Pre-processing via Meta-learning
Bilalli, Besim
Abello, Alberto
Aluja-Banet, Tomas
Wrembel, Robert
[J]. MODEL AND DATA ENGINEERING, 2016, 9893 : 194 - 208
[10] Bisong E., 2019, Building Machine Learning and Deep Learning Models on Google Cloud Platform, P581

← 1 2 3 4 5 6 →