GOGGLES: Automatic Image Labeling with Affinity Coding

被引:22
作者
Das, Nilaksh [1 ]
Chaba, Sanya [1 ]
Wu, Renzhi [1 ]
Gandhi, Sakshi [1 ]
Chau, Duen Horng [1 ]
Chu, Xu [1 ]
机构
[1] Georgia Inst Technol, Atlanta, GA 30332 USA
来源
SIGMOD'20: PROCEEDINGS OF THE 2020 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA | 2020年
关键词
affinity coding; probabilistic labels; data programming; weak supervision; computer vision; image labeling;
D O I
10.1145/3318464.3380592
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Generating large labeled training data is becoming the biggest bottleneck in building and deploying supervised machine learning models. Recently, the data programming paradigm has been proposed to reduce the human cost in labeling training data. However, data programming relies on designing labeling functions which still requires significant domain expertise. Also, it is prohibitively difficult to write labeling functions for image datasets as it is hard to express domain knowledge using raw features for images (pixels). We propose affinity coding, a new domain-agnostic paradigm for automated training data labeling. The core premise of affinity coding is that the affinity scores of instance pairs belonging to the same class on average should be higher than those of pairs belonging to different classes, according to some affinity functions. We build the GOGGLES system that implements affinity coding for labeling image datasets by designing a novel set of reusable affinity functions for images, and propose a novel hierarchical generative model for class inference using a small development set. We compare GOGGLES with existing data programming systems on 5 image labeling tasks from diverse domains. GOGGLES achieves labeling accuracies ranging from a minimum of 71% to a maximum of 98% without requiring any extensive human annotation. In terms of end-to-end performance, GOGGLES outperforms the state-of-the-art data programming system Snuba by 21% and a state-of-the-art few-shot learning technique by 5%, and is only 7% away from the fully supervised upper bound.
引用
收藏
页码:1717 / 1732
页数:16
相关论文
共 38 条
[1]  
Akilan T, 2017, IEEE SYS MAN CYBERN, P566, DOI 10.1109/SMC.2017.8122666
[2]  
[Anonymous], 2011, Technical Report
[3]  
[Anonymous], 2012, P 20 ACM INT C MULTI
[4]  
Bishop Christopher M, 2006, LEARNING PATTERN REC
[5]  
Chen W.-Y., 2019, P INT C LEARN REPR I
[6]   Histograms of oriented gradients for human detection [J].
Dalal, N ;
Triggs, B .
2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, :886-893
[7]   Towards Globally Optimal Crowdsourcing Quality Management: The Uniform Worker Setting [J].
Das Sarma, Akash ;
Parameswaran, Aditya ;
Widom, Jennifer .
SIGMOD'16: PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2016, :47-62
[8]   A CTD-Pfizer collaboration: manual curation of 88 000 scientific articles text mined for drug-disease and drug-phenotype interactions [J].
Davis, Allan Peter ;
Wiegers, Thomas C. ;
Roberts, Phoebe M. ;
King, Benjamin L. ;
Lay, Jean M. ;
Lennon-Hopkins, Kelley ;
Sciaky, Daniela ;
Johnson, Robin ;
Keating, Heather ;
Greene, Nigel ;
Hernandez, Robert ;
McConnell, Kevin J. ;
Enayetallah, Ahmed E. ;
Mattingly, Carolyn J. .
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2013,
[9]   COVARIANCE SELECTION [J].
DEMPSTER, AP .
BIOMETRICS, 1972, 28 (01) :157-&
[10]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38