A game-based framework for crowdsourced data labeling

被引:5
|
作者
Yang, Jingru [1 ]
Fan, Ju [1 ]
Wei, Zhewei [1 ]
Li, Guoliang [2 ]
Liu, Tongyu [1 ]
Du, Xiaoyong [1 ]
机构
[1] Renmin Univ China, Beijing 100872, Peoples R China
[2] Tsinghua Univ, Beijing 100084, Peoples R China
关键词
Crowdsourcing; Data labeling; Labeling rules;
D O I
10.1007/s00778-020-00613-w
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Data labeling, which assigns data with multiple classes, is indispensable for many applications, such as machine learning and data integration. However, existing labeling solutions either incur expensive cost for large datasets or produce noisy results. This paper introduces a cost-effective labeling approach and focuses on the labeling rule generation problem that aims to generate high-quality rules to largely reduce the labeling cost while preserving quality. To address the problem, we first generate candidate rules and then devise a game-based crowdsourcing approach CrowdGame to select high-quality rules by considering coverage and accuracy. CrowdGame employs two groups of crowd workers: One group answers rule validation tasks (whether a rule is valid) to play a role of rule generator, while the other group answers tuple checking tasks (whether the label of a data tuple is correct) to play a role of rule refuter. We let the two groups play a two-player game: Rule generator identifies high-quality rules with large coverage, while rule refuter tries to refute its opponent rule generator by checking some tuples that provide enough evidence to reject rules with low accuracy. This paper studies the challenges in CrowdGame. The first is to balance the trade-off between coverage and accuracy. We define the loss of a rule by considering the two factors. The second is rule accuracy estimation. We utilize Bayesian estimation to combine both rule validation and tuple checking tasks. The third is to select crowdsourcing tasks to fulfill the game-based framework for minimizing the loss. We introduce a minimax strategy and develop efficient task selection algorithms. We also develop a hybrid crowd-machine method for effective label assignment under budget-constrained crowdsourcing settings. We conduct experiments on entity matching and relation extraction, and the results show that our method outperforms state-of-the-art solutions.
引用
收藏
页码:1311 / 1336
页数:26
相关论文
共 50 条
  • [41] Blockchain-based Reputation Management Framework for Crowdsourced Last-mile Delivery
    Kadadha, Maha
    Mizouni, Rabeb
    Singh, Shakti
    Otrok, Hadi
    Mourad, Azzam
    2023 INTERNATIONAL WIRELESS COMMUNICATIONS AND MOBILE COMPUTING, IWCMC, 2023, : 1244 - 1249
  • [42] A Pilot Study: A Statistical Analysis for the Crowdsourced Design Hvaluation Results based on the cDesign Framework
    Wu, Hao
    Corney, Jonathan
    Gan, Jing
    PROCEEDINGS OF THE 2019 IEEE 23RD INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN (CSCWD), 2019, : 295 - 300
  • [43] ARAPID: Towards Integrating Crowdsourced Playtesting into the Game Development Environment
    Paranthaman, Pratheep Kumar
    Cooper, Seth
    CHI PLAY'19: PROCEEDINGS OF THE ANNUAL SYMPOSIUM ON COMPUTER-HUMAN INTERACTION IN PLAY, 2019, : 121 - 133
  • [44] A Game Theory Approach for Estimating Reliability of Crowdsourced Relevance Assessments
    Moshfeghi, Yashar
    Huertas-Rosero, Alvaro Francisco
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2022, 40 (03)
  • [45] A Roughset Based Data Labeling Method for Clustering Categorical Data
    Reddy, H. Venkateswara
    Raju, S. Viswanadha
    2014 3RD INTERNATIONAL CONFERENCE ON ECO-FRIENDLY COMPUTING AND COMMUNICATION SYSTEMS (ICECCS 2014), 2014, : 51 - 55
  • [46] Maximizing benefits from crowdsourced data
    Barbier, Geoffrey
    Zafarani, Reza
    Gao, Huiji
    Fung, Gabriel
    Liu, Huan
    COMPUTATIONAL AND MATHEMATICAL ORGANIZATION THEORY, 2012, 18 (03) : 257 - 279
  • [47] Crowdsourced Data Management: Overview and Challenges
    Li, Guoliang
    Zheng, Yudian
    Fan, Ju
    Wang, Jiannan
    Cheng, Reynold
    SIGMOD'17: PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2017, : 1711 - 1716
  • [48] Maximizing benefits from crowdsourced data
    Geoffrey Barbier
    Reza Zafarani
    Huiji Gao
    Gabriel Fung
    Huan Liu
    Computational and Mathematical Organization Theory, 2012, 18 : 257 - 279
  • [49] A Trustworthiness Model for Crowdsourced and Crowdsensed Data
    Prandi, Catia
    Ferretti, Stefano
    Mirri, Silvia
    Salomoni, Paola
    2015 IEEE TRUSTCOM/BIGDATASE/ISPA, VOL 1, 2015, : 1261 - 1266
  • [50] Responsible processing of crowdsourced tourism data
    Leal, Fatima
    Malheiro, Benedita
    Veloso, Bruno
    Carlos Burguillo, Juan
    JOURNAL OF SUSTAINABLE TOURISM, 2021, 29 (05) : 774 - 794