Semi-Supervised Linear Regression

被引:28
作者
Azriel, David [1 ]
Brown, Lawrence D. [2 ]
Sklar, Michael [3 ]
Berk, Richard [2 ]
Buja, Andreas [2 ]
Zhao, Linda [2 ]
机构
[1] Technion Israel Inst Technol, Haifa, Israel
[2] Univ Penn, Wharton Sch, Dept Stat, Philadelphia, PA 19104 USA
[3] Stanford Univ, Dept Stat, Stanford, CA 94305 USA
关键词
Linear regression; Misspecified models; Semi-supervised learning; INFERENCE; EFFICIENT;
D O I
10.1080/01621459.2021.1915320
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
We study a regression problem where for some part of the data we observe both the label variable (Y) and the predictors (X), while for other part of the data only the predictors are given. Such a problem arises, for example, when observations of the label variable are costly and may require a skilled human agent. When the conditional expectation E[Y vertical bar X] is not exactly linear, one can consider the best linear approximation to the conditional expectation, which can be estimated consistently by the least-square estimates (LSE). The latter depends only on the labeled data. We suggest improved alternative estimates to the LSE that use also the unlabeled data. Our estimation method can be easily implemented and has simply described asymptotic properties. The new estimates asymptotically dominate the usual standard procedures under certain non-linearity condition of E[Y vertical bar X]; otherwise, they are asymptotically equivalent. The performance of the new estimator for small sample size is investigated in an extensive simulation study. A real data example of inferring homeless population is used to illustrate the new methodology.
引用
收藏
页码:2238 / 2251
页数:14
相关论文
共 24 条
[1]  
[Anonymous], 1974, Theoretical Statistics
[2]  
[Anonymous], 2008, Advances in Neural Information Processing Systems
[3]   The conditionality principle in high-dimensional regression [J].
Azriel, D. .
BIOMETRIKA, 2019, 106 (03) :702-707
[4]   Assumption Lean Regression [J].
Berk, Richard ;
Buja, Andreas ;
Brown, Lawrence ;
George, Edward ;
Kuchibhotla, Arun Kumar ;
Su, Weijie ;
Zhao, Linda .
AMERICAN STATISTICIAN, 2021, 75 (01) :76-84
[5]  
BROWN LD, 1990, ANN STAT, V18, P471, DOI 10.1214/aos/1176347602
[6]   High-dimensional inference in misspecified linear models [J].
Buehlmann, Peter ;
van de Geer, Sara .
ELECTRONIC JOURNAL OF STATISTICS, 2015, 9 (01) :1449-1473
[7]   Models as Approximations I: Consequences Illustrated with Linear Regression [J].
Buja, Andreas ;
Brown, Lawrence ;
Berk, Richard ;
George, Edward ;
Pitkin, Emil ;
Traskin, Mikhail ;
Zhang, Kai ;
Zhao, Linda .
STATISTICAL SCIENCE, 2019, 34 (04) :523-544
[8]   EFFICIENT AND ADAPTIVE LINEAR REGRESSION IN SEMI-SUPERVISED SETTINGS [J].
Chakrabortty, Abhishek ;
Cai, Tianxi .
ANNALS OF STATISTICS, 2018, 46 (04) :1541-1572
[9]  
Cochran W. G., 1977, Sampling Techniques, V3rd
[10]   Event labeling combining ensemble detectors and background knowledge [J].
Fanaee-T H. ;
Gama J. .
Progress in Artificial Intelligence, 2014, 2 (2-3) :113-127