Label noise correction for crowdsourcing using dynamic resampling

被引:1
作者
Zhang, Jing [1 ,2 ]
Jiang, Xiaoqian [1 ,2 ]
Tian, Nianshang [3 ]
Wu, Ming [4 ]
机构
[1] Southeast Univ, Sch Cyber Sci & Engn, 2 SEU Rd, Nanjing 211189, Peoples R China
[2] Southeast Univ, Engn Res Ctr Blockchain Applicat Supervis & Manage, Minist Educ, 2 SEU Rd, Nanjing 211189, Peoples R China
[3] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, 200 Xiaolingwei St, Nanjing 210094, Peoples R China
[4] Hohai Univ, Sch Artificial Intelligence & Automat, Nanjing 211100, Peoples R China
关键词
Crowdsourcing learning; Label noise correction; Label integration; Dynamic resampling; Ensemble learning; MODEL QUALITY;
D O I
10.1016/j.engappai.2024.108439
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Crowdsourcing provides a cost-effective labeling solution for the acquisition of labeled training samples for machine learning by employing workers on the Internet. A common approach to improving the label quality is to employ a truth inference method to infer integrated labels for samples from their multiple noisy labels obtained from different crowd workers. Although the quality of integrated labels is significantly improved compared with that of the original noisy ones, it still cannot completely eliminate the noises inevitably existing in the integrated labels. To further improve the label quality, this paper proposes a novel label noise correction method for crowdsourcing based on dynamic resampling (DRNC). DRNC first divides the dataset with inferred labels into a clean set and a noisy set through a filter. According to a certain proportion, the clean set and the noisy set are resampled to train multiple heterogeneous classifiers, which form an ensemble classifier. Then, the dataset is divided by the ensemble classifier into a new sub -noisy set and a sub -clean set. The whole process repeats multiple rounds, generating multiple sub -clean sets. Finally, these sub -clean sets are used to train classifiers, which jointly correct the wrong labels in the dataset by voting. Experimental results on 25 simulated and 4 real -world datasets consistently show that the proposed DRNC averagely improves the quality of labels as well as the quality of learned models in the range of 1 to 10 percentage points, compared with four state-of-the-art crowdsourcing noise correction methods.
引用
收藏
页数:11
相关论文
共 51 条
[1]   Identifying mislabeled training data [J].
Brodley, CE ;
Friedl, MA .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 1999, 11 :131-167
[2]  
Chen MC, 2023, AAAI CONF ARTIF INTE, P14765
[3]   A framework for label noise filters [J].
Chen, Qingqiang ;
Jiang, Gaoxia ;
Cao, Fuyuan ;
Men, Changqian ;
Wang, Wenjian .
PATTERN RECOGNITION, 2024, 147
[4]   Label augmented and weighted majority voting for crowdsourcing [J].
Chen, Ziqi ;
Jiang, Liangxiao ;
Li, Chaoqun .
INFORMATION SCIENCES, 2022, 606 :397-409
[5]   Label distribution-based noise correction for multiclass crowdsourcing [J].
Chen, Ziqi ;
Jiang, Liangxiao ;
Li, Chaoqun .
INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2022, 37 (09) :5752-5767
[6]   Quality Control in Crowdsourcing: A Survey of Quality Attributes, Assessment Techniques, and Assurance Actions [J].
Daniel, Florian ;
Kucherbaev, Pavel ;
Cappiello, Cinzia ;
Benatallah, Boualem ;
Allahbakhsh, Mohammad .
ACM COMPUTING SURVEYS, 2018, 51 (01)
[7]  
Dawid, 1979, APPL STAT, V28, P20, DOI [DOI 10.2307/2346806, 10.2307/2346806]
[8]  
Demartini G ..., 2012, P 21 INT C WORLD WID, P469
[9]   A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms [J].
Derrac, Joaquin ;
Garcia, Salvador ;
Molina, Daniel ;
Herrera, Francisco .
SWARM AND EVOLUTIONARY COMPUTATION, 2011, 1 (01) :3-18
[10]   Improving data and model quality in crowdsourcing using co-training-based noise correction [J].
Dong, Yu ;
Jiang, Liangxiao ;
Li, Chaoqun .
INFORMATION SCIENCES, 2022, 583 :174-188