The reusable holdout: Preserving validity in adaptive data analysis

被引:189
作者
Dwork, Cynthia [1 ]
Feldman, Vitaly [2 ]
Hardt, Moritz [3 ]
Pitassi, Toniann [4 ]
Reingold, Omer [5 ]
Roth, Aaron [6 ]
机构
[1] Microsoft Res, Mountain View, CA 94043 USA
[2] IBM Almaden Res Ctr, San Jose, CA 95120 USA
[3] Google Res, Mountain View, CA 94043 USA
[4] Univ Toronto, Dept Comp Sci, Toronto, ON M5S 3G4, Canada
[5] Samsung Res Amer, Mountain View, CA 94043 USA
[6] Univ Penn, Dept Comp & Informat Sci, Philadelphia, PA 19104 USA
基金
美国国家科学基金会; 加拿大自然科学与工程研究理事会;
关键词
FALSE DISCOVERY; STABILITY;
D O I
10.1126/science.aaa9375
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Misapplication of statistical data analysis is a common cause of spurious discoveries in scientific research. Existing approaches to ensuring the validity of inferences drawn from data assume a fixed procedure to be performed, selected before the data are examined. In common practice, however, data analysis is an intrinsically adaptive process, with new analyses generated on the basis of data exploration, as well as the results of previous analyses on the same data. We demonstrate a new approach for addressing the challenges of adaptivity based on insights from privacy-preserving data analysis. As an application, we show how to safely reuse a holdout data set many times to validate the results of adaptively chosen analyses.
引用
收藏
页码:636 / 638
页数:3
相关论文
共 17 条