A game-based framework for crowdsourced data labeling

被引：5

作者：

Yang, Jingru ^{[1
]}

Fan, Ju ^{[1
]}

Wei, Zhewei ^{[1
]}

Li, Guoliang ^{[2
]}

Liu, Tongyu ^{[1
]}

Du, Xiaoyong ^{[1
]}

机构：

[1] Renmin Univ China, Beijing 100872, Peoples R China

[2] Tsinghua Univ, Beijing 100084, Peoples R China

来源：

VLDB JOURNAL | 2020年 / 29卷 / 06期

关键词：

Crowdsourcing; Data labeling; Labeling rules;

D O I：

10.1007/s00778-020-00613-w

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Data labeling, which assigns data with multiple classes, is indispensable for many applications, such as machine learning and data integration. However, existing labeling solutions either incur expensive cost for large datasets or produce noisy results. This paper introduces a cost-effective labeling approach and focuses on the labeling rule generation problem that aims to generate high-quality rules to largely reduce the labeling cost while preserving quality. To address the problem, we first generate candidate rules and then devise a game-based crowdsourcing approach CrowdGame to select high-quality rules by considering coverage and accuracy. CrowdGame employs two groups of crowd workers: One group answers rule validation tasks (whether a rule is valid) to play a role of rule generator, while the other group answers tuple checking tasks (whether the label of a data tuple is correct) to play a role of rule refuter. We let the two groups play a two-player game: Rule generator identifies high-quality rules with large coverage, while rule refuter tries to refute its opponent rule generator by checking some tuples that provide enough evidence to reject rules with low accuracy. This paper studies the challenges in CrowdGame. The first is to balance the trade-off between coverage and accuracy. We define the loss of a rule by considering the two factors. The second is rule accuracy estimation. We utilize Bayesian estimation to combine both rule validation and tuple checking tasks. The third is to select crowdsourcing tasks to fulfill the game-based framework for minimizing the loss. We introduce a minimax strategy and develop efficient task selection algorithms. We also develop a hybrid crowd-machine method for effective label assignment under budget-constrained crowdsourcing settings. We conduct experiments on entity matching and relation extraction, and the results show that our method outperforms state-of-the-art solutions.

引用

页码：1311 / 1336

页数：26

共 50 条

[41] Blockchain-based Reputation Management Framework for Crowdsourced Last-mile Delivery
Kadadha, Maha
Mizouni, Rabeb
Singh, Shakti
Otrok, Hadi
Mourad, Azzam
2023 INTERNATIONAL WIRELESS COMMUNICATIONS AND MOBILE COMPUTING, IWCMC, 2023, : 1244 - 1249
[42] A Pilot Study: A Statistical Analysis for the Crowdsourced Design Hvaluation Results based on the cDesign Framework
Wu, Hao
Corney, Jonathan
Gan, Jing
PROCEEDINGS OF THE 2019 IEEE 23RD INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN (CSCWD), 2019, : 295 - 300
[43] ARAPID: Towards Integrating Crowdsourced Playtesting into the Game Development Environment
Paranthaman, Pratheep Kumar
Cooper, Seth
CHI PLAY'19: PROCEEDINGS OF THE ANNUAL SYMPOSIUM ON COMPUTER-HUMAN INTERACTION IN PLAY, 2019, : 121 - 133
[44] A Game Theory Approach for Estimating Reliability of Crowdsourced Relevance Assessments
Moshfeghi, Yashar
Huertas-Rosero, Alvaro Francisco
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2022, 40 (03)
[45] A Roughset Based Data Labeling Method for Clustering Categorical Data
Reddy, H. Venkateswara
Raju, S. Viswanadha
2014 3RD INTERNATIONAL CONFERENCE ON ECO-FRIENDLY COMPUTING AND COMMUNICATION SYSTEMS (ICECCS 2014), 2014, : 51 - 55
[46] Maximizing benefits from crowdsourced data
Barbier, Geoffrey
Zafarani, Reza
Gao, Huiji
Fung, Gabriel
Liu, Huan
COMPUTATIONAL AND MATHEMATICAL ORGANIZATION THEORY, 2012, 18 (03) : 257 - 279
[47] Crowdsourced Data Management: Overview and Challenges
Li, Guoliang
Zheng, Yudian
Fan, Ju
Wang, Jiannan
Cheng, Reynold
SIGMOD'17: PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2017, : 1711 - 1716
[48] Maximizing benefits from crowdsourced data
Geoffrey Barbier
Reza Zafarani
Huiji Gao
Gabriel Fung
Huan Liu
Computational and Mathematical Organization Theory, 2012, 18 : 257 - 279
[49] A Trustworthiness Model for Crowdsourced and Crowdsensed Data
Prandi, Catia
Ferretti, Stefano
Mirri, Silvia
Salomoni, Paola
2015 IEEE TRUSTCOM/BIGDATASE/ISPA, VOL 1, 2015, : 1261 - 1266
[50] Responsible processing of crowdsourced tourism data
Leal, Fatima
Malheiro, Benedita
Veloso, Bruno
Carlos Burguillo, Juan
JOURNAL OF SUSTAINABLE TOURISM, 2021, 29 (05) : 774 - 794

← 1 2 3 4 5 →