Big Data and Large Sample Size: A Cautionary Note on the Potential for Bias

被引:279
作者
Kaplan, Robert M. [1 ,2 ]
Chambers, David A. [3 ]
Glasgow, Russell E. [4 ]
机构
[1] NIH, Off Behav & Social Sci Res, Bethesda, MD 20892 USA
[2] NIH, Dept Rehabil Med, Bethesda, MD 20892 USA
[3] NIMH, Div Serv & Intervent Res, Bethesda, MD 20892 USA
[4] Univ Colorado, Colorado Hlth Outcomes Program, Anschutz, CO USA
来源
CTS-CLINICAL AND TRANSLATIONAL SCIENCE | 2014年 / 7卷 / 04期
关键词
big data; research methods; bias; sampling; CARDIOVASCULAR-DISEASE; NURSES HEALTH; THERAPY;
D O I
10.1111/cts.12178
中图分类号
R-3 [医学研究方法]; R3 [基础医学];
学科分类号
1001 ;
摘要
A number of commentaries have suggested that large studies are more reliable than smaller studies and there is a growing interest in the analysis of "big data" that integrates information from many thousands of persons and/or different data sources. We consider a variety of biases that are likely in the era of big data, including sampling error, measurement error, multiple comparisons errors, aggregation error, and errors associated with the systematic exclusion of information. Using examples from epidemiology, health services research, studies on determinants of health, and clinical trials, we conclude that it is necessary to exercise greater caution to be sure that big sample size does not lead to big inferential errors. Despite the advantages of big studies, large sample size can magnify the bias associated with error resulting from sampling or study design.
引用
收藏
页码:342 / 346
页数:5
相关论文
共 25 条
[1]   Raise standards for preclinical cancer research [J].
Begley, C. Glenn ;
Ellis, Lee M. .
NATURE, 2012, 483 (7391) :531-533
[2]  
Council NR, 2014, CAPT SOC BEH DOM EL, V1
[3]   Harmonized patient-reported data elements in the electronic health record: supporting meaningful use by primary care action on health behaviors and key psychosocial factors [J].
Estabrooks, Paul A. ;
Boyle, Maureen ;
Emmons, Karen M. ;
Glasgow, Russell E. ;
Hesse, Bradford W. ;
Kaplan, Robert M. ;
Krist, Alexander H. ;
Moser, Richard P. ;
Taylor, Martina V. .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2012, 19 (04) :575-582
[4]  
Freedman D., 2004, Statistics, V4th ed
[5]   The ecology of medical care revisited [J].
Green, LA ;
Fryer, GE ;
Yawn, BP ;
Lanier, D ;
Dovey, SM .
NEW ENGLAND JOURNAL OF MEDICINE, 2001, 344 (26) :2021-2025
[6]   Effect sizes and p values: What should be reported and what should be replicated? [J].
Greenwald, AG ;
Gonzalez, R ;
Harris, RJ ;
Guthrie, D .
PSYCHOPHYSIOLOGY, 1996, 33 (02) :175-183
[7]   Postmenopausal hormone use and secondary prevention of coronary events in the nurses' health study - A prospective, observational study [J].
Grodstein, F ;
Manson, JE ;
Stampfer, MJ .
ANNALS OF INTERNAL MEDICINE, 2001, 135 (01) :1-8
[8]   Caveats for the Use of Operational Electronic Health Record Data in Comparative Effectiveness Research [J].
Hersh, William R. ;
Weiner, Mark G. ;
Embi, Peter J. ;
Logan, Judith R. ;
Payne, Philip R. O. ;
Bernstam, Elmer V. ;
Lehmann, Harold P. ;
Hripcsak, George ;
Hartzog, Timothy H. ;
Cimino, James J. ;
Saltz, Joel H. .
MEDICAL CARE, 2013, 51 (08) :S30-S37
[9]   Replication validity of genetic association studies [J].
Ioannidis, JPA ;
Ntzani, EE ;
Trikalinos, TA ;
Contopoulos-Ioannidis, DG .
NATURE GENETICS, 2001, 29 (03) :306-309
[10]   Electronic Medical Records for Genetic Research: Results of the eMERGE Consortium [J].
Kho, Abel N. ;
Pacheco, Jennifer A. ;
Peissig, Peggy L. ;
Rasmussen, Luke ;
Newton, Katherine M. ;
Weston, Noah ;
Crane, Paul K. ;
Pathak, Jyotishman ;
Chute, Christopher G. ;
Bielinski, Suzette J. ;
Kullo, Iftikhar J. ;
Li, Rongling ;
Manolio, Teri A. ;
Chisholm, Rex L. ;
Denny, Joshua C. .
SCIENCE TRANSLATIONAL MEDICINE, 2011, 3 (79)