Auto-Prep: Efficient and Automated Data Preprocessing Pipeline

被引:14
作者
Bilal, Mehwish [1 ]
Ali, Ghulam [1 ]
Iqbal, Muhammad Waseem [2 ]
Anwar, Muhammad [3 ]
Malik, Muhammad Sheraz Arshad [4 ]
Kadir, Rabiah Abdul [5 ]
机构
[1] Univ Okara, Dept Comp Sci, Okara 56300, Pakistan
[2] Super Univ Lahore, Dept Software Engn, Lahore 54000, Pakistan
[3] Univ Educ, Dept Informat Sci, Div Sci & Technol, Lahore 54000, Pakistan
[4] Govt Coll Univ Faisalabad, Dept Informat Technol, Faisalabad 38000, Pakistan
[5] Univ Kebangsaan Malaysia, Inst IR4 0, Bangi 43600, Selangor, Malaysia
关键词
Encoding; Data preprocessing; Machine learning; Feature extraction; Data models; Dimensionality reduction; Support vector machines; Pipelines; Automated machine learning; data preprocessing; feature engineering; DIMENSIONALITY; IMPUTATION; SELECTION;
D O I
10.1109/ACCESS.2022.3198662
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data preprocessing is crucial in the Machine Learning pipeline because the models' learning ability directly affects the quality of data and the underlying information acquired from this stage. Nevertheless, surprisingly, there are many alternatives for each transformation task, which makes an inexperienced user overwhelmed. A simple Python-based Auto-preprocessing architecture for Automated Machine Learning is developed to offer automated, interactive, and data-driven support to help the users perform data preprocessing tasks efficiently. The suggested method provides valuable insights into a dataset and can handle standard data preprocessing tasks adeptly. Initially, it detects the data problem and presents it to the end-user using compelling visualizations. Then, it recommends the most effective data cleaning and preparation method to the user after evaluating the state-of-the-art candidate techniques. For evaluation, the proposed architecture is employed on ten different and diverse datasets for automatic data preprocessing before passing it to an ML algorithm. The results are then compared with the results generated by the same ML algorithm but implemented on manually preprocessed data. The results have shown that not only did this approach make the whole process uncomplicated and facile, but it was also able to improve the performance of the model significantly.
引用
收藏
页码:107764 / 107784
页数:21
相关论文
共 51 条
  • [1] Effect of Data Scaling Methods on Machine Learning Algorithms and Model Performance
    Ahsan, Md Manjurul
    Mahmud, M. A. Parvez
    Saha, Pritom Kumar
    Gupta, Kishor Datta
    Siddique, Zahed
    [J]. TECHNOLOGIES, 2021, 9 (03)
  • [2] Alasadi W. S., 2017, J. Eng. Appl. Sci., V12, P4102, DOI DOI 10.3923/JEASCI.2017.4102.4107
  • [3] Ambarwari A., 2020, Rekayasa Sistem Dan Teknologi Informasi, V4, P117, DOI [10.29207/resti.v4i1.1517, DOI 10.29207/RESTI.V4I1.1517]
  • [4] Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools
    Anh Truong
    Walters, Austin
    Goodsitt, Jeremy
    Hines, Keegan
    Bruss, C. Bayan
    Farivar, Reza
    [J]. 2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019), 2019, : 1471 - 1479
  • [5] Comparison of Temporal and Non-Temporal Features Effect on Machine Learning Models Quality and Interpretability for Chronic Heart Failure Patients
    Balabaeva, Ksenia
    Kovalchuk, Sergey
    [J]. 8TH INTERNATIONAL YOUNG SCIENTISTS CONFERENCE ON COMPUTATIONAL SCIENCE, YSC2019, 2019, 156 : 87 - 96
  • [6] Barnes J., 2015, AZURE MACHINE LEARNI
  • [7] Basak S, 2020, IEEE SYS MAN CYBERN, P928, DOI [10.1109/SMC42975.2020.9282834, 10.1109/smc42975.2020.9282834]
  • [8] Bennett DA, 2001, AUST NZ J PUBL HEAL, V25, P464, DOI 10.1111/j.1467-842X.2001.tb00294.x
  • [9] Automated Data Pre-processing via Meta-learning
    Bilalli, Besim
    Abello, Alberto
    Aluja-Banet, Tomas
    Wrembel, Robert
    [J]. MODEL AND DATA ENGINEERING, 2016, 9893 : 194 - 208
  • [10] Bisong E., 2019, Building Machine Learning and Deep Learning Models on Google Cloud Platform, P581