Linear regression and the normality assumption

被引：336

作者：

Schmidt, Amand F. ^{[1
,2
,3
]}

Finan, Chris ^{[1
]}

机构：

[1] UCL, Inst Cardiovasc Sci, Fac Populat Hlth, London WC1E 6BT, England

[2] Univ Groningen, Groningen Res Inst Pharm, Groningen, Netherlands

[3] Univ Med Ctr Utrecht, Div Heart & Lungs, Dept Cardiol, Utrecht, Netherlands

来源：

JOURNAL OF CLINICAL EPIDEMIOLOGY | 2018年 / 98卷

关键词：

Epidemiological methods; Bias; Linear regression; Modeling assumptions; Statistical inference; Big data;

D O I：

10.1016/j.jclinepi.2017.12.006

中图分类号：

R19 [保健组织与事业（卫生事业管理）];

学科分类号：

摘要：

Objectives: Researchers often perform arbitrary outcome transformations to fulfill the normality assumption of a linear regression model. This commentary explains and illustrates that in large data settings, such transformations are often unnecessary, and worse may bias model estimates. Study Design and Setting: Linear regression assumptions are illustrated using simulated data and an empirical example on the relation between time since type 2 diabetes diagnosis and glycated hemoglobin levels. Simulation results were evaluated on coverage; i.e., the number of times the 95% confidence interval included the true slope coefficient. Results: Although outcome transformations bias point estimates, violations of the normality assumption in linear regression analyses do not. The normality assumption is necessary to unbiasedly estimate standard errors, and hence confidence intervals and P-values. However, in large sample sizes (e.g., where the number of observations per variable is >10) violations of this normality assumption often do not noticeably impact results. Contrary to this, assumptions on, the parametric model, absence of extreme observations, homoscedasticity, and independency of the errors, remain influential even in large sample size settings. Conclusion: Given that modern healthcare research typically includes thousands of subjects focusing on the normality assumption is often unnecessary, does not guarantee valid results, and worse may bias estimates due to the practice of outcome transformations. (C) 2017 Elsevier Inc. All rights reserved.

引用

页码：146 / 151

页数：6

共 50 条

[31] The number of subjects per variable required in linear regression analyses
Austin, Peter C.
Steyerberg, Ewout W.
JOURNAL OF CLINICAL EPIDEMIOLOGY, 2015, 68 (06) : 627 - 636
[32] Linear regression for uplift modeling
Krzysztof Rudaś
Szymon Jaroszewicz
Data Mining and Knowledge Discovery, 2018, 32 : 1275 - 1305
[33] Instance weighted linear regression
Faculty of Mathematics, China University of Geosciences, Wuhan 430074, China
J. Comput. Inf. Syst., 2008, 6 (2395-2402):
[34] Linear Regression Residuals Recomputed
Haddad, John N.
INTERNATIONAL JOURNAL OF MATHEMATICS AND COMPUTER SCIENCE, 2016, 11 (01) : 43 - 49
[35] Equivalence Testing for Linear Regression
Campbell, Harlan
PSYCHOLOGICAL METHODS, 2024, 29 (01) : 88 - 98
[36] Implementing Efficient and Scalable In-Database Linear Regression in SQL
Giesser, Patrick
Stechschulte, Gabriel
Vaz, Anna da Costa
Kaufmann, Michael
2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 5125 - 5132
[37] Consistent estimation of linear regression models using matched data
Hirukawa, Masayuki
Prokhorov, Artem
JOURNAL OF ECONOMETRICS, 2018, 203 (02) : 344 - 358
[38] Common framework for linear regression
Hoskuldsson, Agnar
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2015, 146 : 250 - 262
[39] Linear regression constrained to a ball
Stoica, P
Ganesan, G
DIGITAL SIGNAL PROCESSING, 2001, 11 (01) : 80 - 90
[40] ON THE ADVERSARIAL ROBUSTNESS OF LINEAR REGRESSION
Li, Fuwei
Lai, Lifeng
Cui, Shuguang
PROCEEDINGS OF THE 2020 IEEE 30TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2020,

← 1 2 3 4 5 →