Progressive Ensemble Learning for in-Sample Data Cleaning

被引:0
作者
Wang, Jung-Hua [1 ,2 ]
Lee, Shih-Kai [1 ]
Wang, Ting-Yuan [3 ]
Chen, Ming-Jer [4 ]
Hsu, Shu-Wei [1 ]
机构
[1] Natl Taiwan Ocean Univ, Dept Elect Engn, Keelung 20224, Taiwan
[2] Natl Taiwan Ocean Univ, AI Res Ctr, Keelung 20224, Taiwan
[3] Ind Technol Res Inst ITRI, Hsinchu 310401, Taiwan
[4] Natl Yang Ming Chiao Tung Univ, Dept Obstet Gynecol & Womens Hlth, Taipei 112, Taiwan
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Training; Data models; Cleaning; Noise measurement; Image classification; Complexity theory; Training data; Ensemble learning; Data integrity; Transfer learning; Convolutional neural networks; Noisy data; ensemble learning; data cleanliness; image classification; true labels;
D O I
10.1109/ACCESS.2024.3468035
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present an ensemble learning-based data cleaning approach (touted as ELDC) capable of identifying and pruning anomaly data. ELDC is characterized in that an ensemble of base models can be trained directly with the noisy in-sample data and can dynamically provide clean data during the iterative training. Each base model uses a random subset of the target dataset that may initially contain up to 40% of label errors. Following each training iteration, anomaly data are discriminated against clean ones by a majority voting scheme, and three different types of anomaly (mislabeled, confusing, and outliers) can be identified using a statistical pattern jointly determined by the prediction output of the base models. By iterating such a cycle of train-vote-remove, noisy in-sample data are progressively removed until a prespecified condition is reached. Comprehensive experiments, including out-sample data tests, are conducted to verify the effectiveness of ELDC in simultaneously suppressing bias and variance of the prediction output. The ELDC framework is highly flexible as it is not bound to a specific model and allows different transfer-learning configurations. Neural networks of AlexNet, ResNet50, and GoogleNet are used as based models and trained with various benchmark datasets, the results show that ELDC outperforms state-of-the-art cleaning methods.
引用
收藏
页码:140643 / 140659
页数:17
相关论文
共 50 条
  • [21] Blending Colored and Depth CNN Pipelines in an Ensemble Learning Classification Approach for Warehouse Application Using Synthetic and Real Data
    Piratelo, Paulo Henrique Martinez
    de Azeredo, Rodrigo Negri
    Yamao, Eduardo Massashi
    Bianchi Filho, Jose Francisco
    Maidl, Gabriel
    Lisboa, Felipe Silveira Marques
    de Jesus, Laercio Pereira
    Penteado Neto, Renato de Arruda
    Coelho, Leandro dos Santos
    Leandro, Gideon Villar
    MACHINES, 2022, 10 (01)
  • [22] Automated Identification of Breast Cancer Type Using Novel Multipath Transfer Learning and Ensemble of Classifier
    Nair, Salini Sasidharan
    Subaji, Mohan
    IEEE ACCESS, 2024, 12 : 87560 - 87578
  • [23] Ensemble Learning for Relational Data
    Eldardiry, Hoda
    Neville, Jennifer
    Rossi, Ryan A.
    JOURNAL OF MACHINE LEARNING RESEARCH, 2020, 21
  • [24] Malaria Cell Images Classification with Deep Ensemble Learning
    Ke, Qi
    Gao, Rong
    Yap, Wun She
    Tee, Yee Kai
    Hum, Yan Chai
    Gan, YuJian
    ADVANCED INTELLIGENT COMPUTING IN BIOINFORMATICS, PT I, ICIC 2024, 2024, 14881 : 417 - 427
  • [25] Ensemble Deep Learning for Enhanced Seismic Data Reconstruction
    Abedi, Mohammad Mahdi
    Pardo, David
    Alkhalifah, Tariq
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 11
  • [26] Through-Wall Human Motion Recognition Based on Transfer Learning and Ensemble Learning
    Chen, Pengyun
    Guo, Shisheng
    Li, Huquan
    Wang, Xiang
    Cui, Guolong
    Jiang, Chaoshu
    Kong, Lingjiang
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19
  • [27] Application of transfer learning and ensemble learning in image-level classification for breast histopathology
    Zheng, Yuchao
    Li, Chen
    Zhou, Xiaomin
    Chen, Haoyuan
    Xu, Hao
    Li, Yixin
    Zhang, Haiqing
    Li, Xiaoyan
    Sun, Hongzan
    Huang, Xinyu
    Grzegorzek, Marcin
    INTELLIGENT MEDICINE, 2023, 3 (02): : 115 - 128
  • [28] Ensemble of Extreme Learning Machines with trained classifier combination and statistical features for hyperspectral data
    Ksieniewicz, Pawel
    Krawczyk, Bartosz
    Wozniak, Michal
    NEUROCOMPUTING, 2018, 271 : 28 - 37
  • [29] Method for Incomplete and Imbalanced Data Based on Multivariate Imputation by Chained Equations and Ensemble Learning
    Li, Jiaxi
    Wang, Zhelong
    Wu, Lina
    Qiu, Sen
    Zhao, Hongyu
    Lin, Fang
    Zhang, Ke
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2024, 28 (05) : 3102 - 3113
  • [30] Early Detection of Multiclass Skin Lesions Using Transfer Learning-Based IncepX-Ensemble Model
    Chatterjee, Subhajit
    Gil, Joon-Min
    Byun, Yung-Cheol
    IEEE ACCESS, 2024, 12 : 113677 - 113693