A game-based framework for crowdsourced data labeling

被引:5
|
作者
Yang, Jingru [1 ]
Fan, Ju [1 ]
Wei, Zhewei [1 ]
Li, Guoliang [2 ]
Liu, Tongyu [1 ]
Du, Xiaoyong [1 ]
机构
[1] Renmin Univ China, Beijing 100872, Peoples R China
[2] Tsinghua Univ, Beijing 100084, Peoples R China
关键词
Crowdsourcing; Data labeling; Labeling rules;
D O I
10.1007/s00778-020-00613-w
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Data labeling, which assigns data with multiple classes, is indispensable for many applications, such as machine learning and data integration. However, existing labeling solutions either incur expensive cost for large datasets or produce noisy results. This paper introduces a cost-effective labeling approach and focuses on the labeling rule generation problem that aims to generate high-quality rules to largely reduce the labeling cost while preserving quality. To address the problem, we first generate candidate rules and then devise a game-based crowdsourcing approach CrowdGame to select high-quality rules by considering coverage and accuracy. CrowdGame employs two groups of crowd workers: One group answers rule validation tasks (whether a rule is valid) to play a role of rule generator, while the other group answers tuple checking tasks (whether the label of a data tuple is correct) to play a role of rule refuter. We let the two groups play a two-player game: Rule generator identifies high-quality rules with large coverage, while rule refuter tries to refute its opponent rule generator by checking some tuples that provide enough evidence to reject rules with low accuracy. This paper studies the challenges in CrowdGame. The first is to balance the trade-off between coverage and accuracy. We define the loss of a rule by considering the two factors. The second is rule accuracy estimation. We utilize Bayesian estimation to combine both rule validation and tuple checking tasks. The third is to select crowdsourcing tasks to fulfill the game-based framework for minimizing the loss. We introduce a minimax strategy and develop efficient task selection algorithms. We also develop a hybrid crowd-machine method for effective label assignment under budget-constrained crowdsourcing settings. We conduct experiments on entity matching and relation extraction, and the results show that our method outperforms state-of-the-art solutions.
引用
收藏
页码:1311 / 1336
页数:26
相关论文
共 50 条
  • [1] A game-based framework for crowdsourced data labeling
    Jingru Yang
    Ju Fan
    Zhewei Wei
    Guoliang Li
    Tongyu Liu
    Xiaoyong Du
    The VLDB Journal, 2020, 29 : 1311 - 1336
  • [2] Optimizing multimedia and gameplay data labeling: A web-based tool for Game-Based Assessment
    Gomez, Manuel J.
    Ruiperez-Valiente, Jose A.
    Clemente, Felix J. Garcia
    SOFTWAREX, 2024, 27
  • [3] Validating Generic Metrics of Fairness in Game-Based Resource Allocation Scenarios with Crowdsourced Annotations
    Grappiolo, Corrado
    Martinez, Hector P.
    Yannakakis, Georgios N.
    TRANSACTIONS ON COMPUTATIONAL COLLECTIVE INTELLIGENCE XIII, 2014, 8342 : 176 - 200
  • [4] Validating generic metrics of fairness in game-based resource allocation scenarios with crowdsourced annotations
    Grappiolo, Corrado
    Martínez, Héctor P.
    Yannakakis, Georgios N.
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2014, 8342 : 176 - 200
  • [5] A System Design Perspective for Business Growth in a Crowdsourced Data Labeling Practice
    Hajipour, Vahid
    Jalali, Sajjad
    Santos-Arteaga, Francisco Javier
    Vazifeh Noshafagh, Samira
    Di Caprio, Debora
    ALGORITHMS, 2024, 17 (08)
  • [6] Game Aspect: An Approach to Separation of Concerns in Crowdsourced Data Management
    Fukusumi, Shun
    Morishima, Atsuyuki
    Kitagawa, Hiroyuki
    ADVANCED INFORMATION SYSTEMS ENGINEERING, CAISE 2015, 2015, 9097 : 3 - 19
  • [7] A Smart Physical World Based on Service Technologies, Big Data, and Game-Based Crowd Sourcing
    Yen, I-Ling
    Zhou, Guang
    Zhu, Wei
    Bastani, Farokh
    Hwang, San-Yih
    2015 IEEE INTERNATIONAL CONFERENCE ON WEB SERVICES (ICWS), 2015, : 765 - 772
  • [8] CyLog/Game aspect: An approach to separation of concerns in crowdsourced data management
    Morishima, Atsuyuki
    Fukusumi, Shun
    Kitagawa, Hiroyuki
    INFORMATION SYSTEMS, 2016, 62 : 170 - 184
  • [9] EFFICIENT WORKER ASSIGNMENT IN CROWDSOURCED DATA LABELING USING GRAPH SIGNAL PROCESSING
    Maroto, Javier
    Ortega, Antonio
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 2271 - 2275
  • [10] Active Learning Based on Crowdsourced Data
    Boinski, Tomasz Maria
    Szymanski, Julian
    Krauzewicz, Agata
    APPLIED SCIENCES-BASEL, 2022, 12 (01):