Preserving Statistical Validity in Adaptive Data Analysis

被引:157
作者
Dwork, Cynthia [1 ]
Feldman, Vitaly [2 ]
Hardt, Moritz [2 ]
Pitassi, Toniann [3 ]
Reingold, Omer [4 ]
Roth, Aaron [5 ]
机构
[1] Microsoft Res, Redmond, WA 98052 USA
[2] IBM Res Almaden, San Jose, CA USA
[3] Univ Toronto, Toronto, ON, Canada
[4] Samsung Res Amer, Mountain View, CA USA
[5] Univ Penn, Philadelphia, PA 19104 USA
来源
STOC'15: PROCEEDINGS OF THE 2015 ACM SYMPOSIUM ON THEORY OF COMPUTING | 2015年
关键词
STABILITY;
D O I
10.1145/2746539.2746580
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
A great deal of effort has been devoted to reducing the risk of spurious scientific discoveries, from the use of sophisticated validation techniques, to deep statistical methods for controlling the false discovery rate in multiple hypothesis testing. However, there is a fundamental disconnect between the theoretical results and the practice of data analysis: the theory of statistical inference assumes a fixed collection of hypotheses to be tested, or learning algorithms to be applied, selected non-adaptively before the data are gathered, whereas in practice data is shared and reused with hypotheses and new analyses being generated on the basis of data exploration and the outcomes of previous analyses. In this work we initiate a principled study of how to guarantee the validity of statistical inference in adaptive data analysis. As an instance of this problem, we propose and investigate the question of estimating the expectations of m adaptively chosen functions on an unknown distribution given n random samples. We show that, surprisingly, there is a way to estimate an exponential in n number of expectations accurately even if the functions are chosen adaptively. This gives an exponential improvement over standard empirical estimators that are limited to a linear number of estimates. Our result follows from a general technique that counter-intuitively involves actively perturbing and coordinating the estimates, using techniques developed for privacy preservation. We give additional applications of this technique to our question.
引用
收藏
页码:117 / 126
页数:10
相关论文
共 39 条
[1]   Generalized α-investing: definitions, optimality results and application to public databases [J].
Aharoni, Ehud ;
Rosset, Saharon .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2014, 76 (04) :771-794
[2]   The Quality Preserving Database: A Computational Framework for Encouraging Collaboration, Enhancing Power and Controlling False Discovery [J].
Aharoni, Ehud ;
Neuvirth, Hani ;
Rosset, Saharon .
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2011, 8 (05) :1431-1437
[3]  
[Anonymous], CORR
[4]  
[Anonymous], 2006, NIPS
[5]  
[Anonymous], 2013, THESIS COLUMBIA U NE
[6]  
[Anonymous], 2003, P 22 ACM SIGMOD SIGA
[7]  
Bassily Raef., 2014, CoRR
[8]   Raise standards for preclinical cancer research [J].
Begley, C. Glenn ;
Ellis, Lee M. .
NATURE, 2012, 483 (7391) :531-533
[9]   CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING [J].
BENJAMINI, Y ;
HOCHBERG, Y .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) :289-300
[10]  
Blum Avrim, 2005, P 24 ACM SIGMOD SIGA, P128, DOI [DOI 10.1145/1065167.1065184, 10.1145/1065167.1065184]