Progressive Ensemble Learning for in-Sample Data Cleaning

被引：0

作者：

Wang, Jung-Hua ^{[1
,2
]}

Lee, Shih-Kai ^{[1
]}

Wang, Ting-Yuan ^{[3
]}

Chen, Ming-Jer ^{[4
]}

Hsu, Shu-Wei ^{[1
]}

机构：

[1] Natl Taiwan Ocean Univ, Dept Elect Engn, Keelung 20224, Taiwan

[2] Natl Taiwan Ocean Univ, AI Res Ctr, Keelung 20224, Taiwan

[3] Ind Technol Res Inst ITRI, Hsinchu 310401, Taiwan

[4] Natl Yang Ming Chiao Tung Univ, Dept Obstet Gynecol & Womens Hlth, Taipei 112, Taiwan

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Training; Data models; Cleaning; Noise measurement; Image classification; Complexity theory; Training data; Ensemble learning; Data integrity; Transfer learning; Convolutional neural networks; Noisy data; ensemble learning; data cleanliness; image classification; true labels;

D O I：

10.1109/ACCESS.2024.3468035

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We present an ensemble learning-based data cleaning approach (touted as ELDC) capable of identifying and pruning anomaly data. ELDC is characterized in that an ensemble of base models can be trained directly with the noisy in-sample data and can dynamically provide clean data during the iterative training. Each base model uses a random subset of the target dataset that may initially contain up to 40% of label errors. Following each training iteration, anomaly data are discriminated against clean ones by a majority voting scheme, and three different types of anomaly (mislabeled, confusing, and outliers) can be identified using a statistical pattern jointly determined by the prediction output of the base models. By iterating such a cycle of train-vote-remove, noisy in-sample data are progressively removed until a prespecified condition is reached. Comprehensive experiments, including out-sample data tests, are conducted to verify the effectiveness of ELDC in simultaneously suppressing bias and variance of the prediction output. The ELDC framework is highly flexible as it is not bound to a specific model and allows different transfer-learning configurations. Neural networks of AlexNet, ResNet50, and GoogleNet are used as based models and trained with various benchmark datasets, the results show that ELDC outperforms state-of-the-art cleaning methods.

引用

页码：140643 / 140659

页数：17

共 50 条

[1] Progressive Ensemble Kernel-Based Broad Learning System for Noisy Data Classification
Yu, Zhiwen
Lan, Kankan
Liu, Zhulin
Han, Guoqiang
IEEE TRANSACTIONS ON CYBERNETICS, 2022, 52 (09) : 9656 - 9669
[2] Noise Avoidance SMOTE in Ensemble Learning for Imbalanced Data
Kim, Kyoungok
IEEE ACCESS, 2021, 9 : 143250 - 143265
[3] Training Data Subset Search With Ensemble Active Learning
Chitta, Kashyap
Alvarez, Jose M.
Haussmann, Elmar
Farabet, Clement
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2022, 23 (09) : 14741 - 14752
[4] IncepX-Ensemble: Performance Enhancement Based on Data Augmentation and Hybrid Learning for Recycling Transparent PET Bottles
Chatterjee, Subhajit
Hazra, Debapriya
Byun, Yung-Cheol
IEEE ACCESS, 2022, 10 : 52280 - 52293
[5] Progressive subspace ensemble learning
Yu, Zhiwen
Wang, Daxing
You, Jane
Wong, Hau-San
Wu, Si
Zhang, Jun
Han, Guoqiang
PATTERN RECOGNITION, 2016, 60 : 692 - 705
[6] Progressive Transfer Learning
Yu, Zhengxu
Shen, Dong
Jin, Zhongming
Huang, Jianqiang
Cai, Deng
Hua, Xian-Sheng
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 1340 - 1348
[7] A Transfer Ensemble Learning Method for Evaluating Power Transformer Health Conditions With Limited Measurement Data
Lin, Jun
Ma, Jin
Zhu, Jian Guo
Cui, Yu
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2022, 71
[8] A Magnetotelluric Data Denoising Method Based on Lightweight Ensemble Learning
Ji, Mingjie
Chen, Huang
Zhang, Chao
Yu, Nian
Kong, Wenxin
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 13
[9] Ensemble Learning Using Individual Neonatal Data for Seizure Detection
Borovac, Ana
Gudmundsson, Steinn
Thorvardsson, Gardar
Moghadam, Saeed M.
Nevalainen, Paivi
Stevenson, Nathan
Vanhatalo, Sampsa
Runarsson, Thomas P.
IEEE JOURNAL OF TRANSLATIONAL ENGINEERING IN HEALTH AND MEDICINE, 2022, 10
[10] RSPCA: Random Sample Partition and Clustering Approximation for ensemble learning of big data
Mahmud, Mohammad Sultan
Zheng, Hua
Garcia-Gil, Diego
Garcia, Salvador
Huang, Joshua Zhexue
PATTERN RECOGNITION, 2025, 161

← 1 2 3 4 5 →