Progressive Ensemble Learning for in-Sample Data Cleaning

被引:0
作者
Wang, Jung-Hua [1 ,2 ]
Lee, Shih-Kai [1 ]
Wang, Ting-Yuan [3 ]
Chen, Ming-Jer [4 ]
Hsu, Shu-Wei [1 ]
机构
[1] Natl Taiwan Ocean Univ, Dept Elect Engn, Keelung 20224, Taiwan
[2] Natl Taiwan Ocean Univ, AI Res Ctr, Keelung 20224, Taiwan
[3] Ind Technol Res Inst ITRI, Hsinchu 310401, Taiwan
[4] Natl Yang Ming Chiao Tung Univ, Dept Obstet Gynecol & Womens Hlth, Taipei 112, Taiwan
关键词
Training; Data models; Cleaning; Noise measurement; Image classification; Complexity theory; Training data; Ensemble learning; Data integrity; Transfer learning; Convolutional neural networks; Noisy data; ensemble learning; data cleanliness; image classification; true labels;
D O I
10.1109/ACCESS.2024.3468035
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present an ensemble learning-based data cleaning approach (touted as ELDC) capable of identifying and pruning anomaly data. ELDC is characterized in that an ensemble of base models can be trained directly with the noisy in-sample data and can dynamically provide clean data during the iterative training. Each base model uses a random subset of the target dataset that may initially contain up to 40% of label errors. Following each training iteration, anomaly data are discriminated against clean ones by a majority voting scheme, and three different types of anomaly (mislabeled, confusing, and outliers) can be identified using a statistical pattern jointly determined by the prediction output of the base models. By iterating such a cycle of train-vote-remove, noisy in-sample data are progressively removed until a prespecified condition is reached. Comprehensive experiments, including out-sample data tests, are conducted to verify the effectiveness of ELDC in simultaneously suppressing bias and variance of the prediction output. The ELDC framework is highly flexible as it is not bound to a specific model and allows different transfer-learning configurations. Neural networks of AlexNet, ResNet50, and GoogleNet are used as based models and trained with various benchmark datasets, the results show that ELDC outperforms state-of-the-art cleaning methods.
引用
收藏
页码:140643 / 140659
页数:17
相关论文
共 50 条
[11]   RSPCA: Random Sample Partition and Clustering Approximation for ensemble learning of big data [J].
Mahmud, Mohammad Sultan ;
Zheng, Hua ;
Garcia-Gil, Diego ;
Garcia, Salvador ;
Huang, Joshua Zhexue .
PATTERN RECOGNITION, 2025, 161
[12]   A Robust Enhanced Ensemble Learning Method for Breast Cancer Data Diagnosis on Imbalanced Data [J].
Wang, Zhenzhen ;
Xie, Junde ;
Zhang, Jia .
IEEE ACCESS, 2024, 12 :189776-189788
[13]   Improving Data Cleaning by Learning From Unstructured Textual Data [J].
Nasfi, Rihem ;
de Tre, Guy ;
Bronselaer, Antoon .
IEEE ACCESS, 2025, 13 :36470-36491
[14]   Improving Breast Cancer Diagnosis in Mammograms with Progressive Transfer Learning and Ensemble Deep Learning [J].
Khaled, Mamar ;
Touazi, Faycal ;
Gaceb, Djamel .
ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2024, :7697-7720
[15]   Active Ensemble Deep Learning for Polarimetric Synthetic Aperture Radar Image Classification [J].
Liu, Sheng-Jie ;
Luo, Haowen ;
Shi, Qian .
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2021, 18 (09) :1580-1584
[16]   EdgeConvEns: Convolutional Ensemble Learning for Edge Intelligence [J].
Sikdokur, Ilkay ;
Baytas, Inci M. ;
Yurdakul, Arda .
IEEE ACCESS, 2024, 12 :168314-168327
[17]   Use of Ensemble Learning to Improve Performance of Known Convolutional Neural Networks for Mammography Classification [J].
Berrones-Reyes, Mayra C. ;
Salazar-Aguilar, M. Angelica ;
Castillo-Olea, Cristian .
APPLIED SCIENCES-BASEL, 2023, 13 (17)
[18]   Bidirectional Stacking Ensemble Curriculum Learning for Hyperspectral Image Imbalanced Classification With Noisy Labels [J].
Wang, Yixin ;
Li, Hao ;
Gong, Maoguo ;
Wu, Yue ;
Gong, Peiran ;
Qin, A. K. ;
Xing, Lining ;
Zhang, Mingyang .
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2025, 63
[19]   Insulator Breakage Detection Utilizing a Convolutional Neural Network Ensemble Implemented With Small Sample Data Augmentation and Transfer Learning [J].
She, Lingcong ;
Fan, Yadong ;
Xu, Mengxi ;
Wang, Jianguo ;
Xue, Jian ;
Ou, Jianhua .
IEEE TRANSACTIONS ON POWER DELIVERY, 2022, 37 (04) :2787-2796
[20]   Deep Learning for Industrial KPI Prediction: When Ensemble Learning Meets Semi-Supervised Data [J].
Sun, Qingqiang ;
Ge, Zhiqiang .
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2021, 17 (01) :260-269