Learning from Untrusted Data

被引:162
作者
Charikar, Moses [1 ]
Steinhardt, Jacob [1 ]
Valiant, Gregory [1 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
来源
STOC'17: PROCEEDINGS OF THE 49TH ANNUAL ACM SIGACT SYMPOSIUM ON THEORY OF COMPUTING | 2017年
基金
美国国家科学基金会;
关键词
outlier removal; robust learning; high-dimensional statistics; HALFSPACES;
D O I
10.1145/3055399.3055491
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The vast majority of theoretical results in machine learning and statistics assume that the training data is a reliable reflection of the phenomena to be learned. Similarly, most learning techniques used in practice are brittle to the presence of large amounts of biased or malicious data. Motivated by this, we consider two frameworks for studying estimation, learning, and optimization in the presence of significant fractions of arbitrary data. The first framework, list-decodable learning, asks whether it is possible to return a list of answers such that at least one is accurate. For example, given a dataset of n points for which an unknown subset of an points are drawn from a distribution of interest, and no assumptions are made about the remaining (1 - alpha) n points, is it possible to return a list of poly(1/alpha) answers? The second framework, which we term the semi-verified model, asks whether a small dataset of trusted data (drawn from the distribution in question) can be used to extract accurate information from a much larger but untrusted dataset (of which only an alpha-fraction is drawn from the distribution). We show strong positive results in both settings, and provide an algorithm for robust learning in a very general stochastic optimization setting. This result has immediate implications for robustly estimating the mean of distributions with bounded second moments, robustly learning mixtures of such distributions, and robustly finding planted partitions in random graphs in which significant portions of the graph have been perturbed by an adversary.
引用
收藏
页码:47 / 60
页数:14
相关论文
共 46 条
[1]  
Achlioptas D., 2005, COLT
[2]  
Agarwal N., 2015, MULTISECTION STOCHAS
[3]  
[Anonymous], 2015, COMMUNITY DETECTION
[4]  
[Anonymous], 2006, BOOK REV IEEE T NEUR
[5]  
[Anonymous], COLT
[6]  
[Anonymous], 2015, ADV NEURAL INF PROCE
[7]  
[Anonymous], DETECTION STOCHASTIC
[8]  
[Anonymous], 2009, Wiley Series in Probability and Statistics, DOI DOI 10.1002/9780470434697.CH7
[9]  
Awasthi Pranjal, 2012, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques. Proceedings 15th International Workshop, APPROX 2012 and 16th International Workshop, RANDOM 2012, P37, DOI 10.1007/978-3-642-32512-0_4
[10]   The Power of Localization for Efficiently Learning Linear Separators with Noise [J].
Awasthi, Pranjal ;
Balcan, Maria Florina ;
Long, Philip M. .
STOC'14: PROCEEDINGS OF THE 46TH ANNUAL 2014 ACM SYMPOSIUM ON THEORY OF COMPUTING, 2014, :449-458