Stratified learning: A general-purpose statistical method for improved learning under covariate shift

被引:3
作者
Autenrieth, Maximilian [1 ]
van Dyk, David A. [1 ]
Trotta, Roberto [2 ,3 ,4 ]
Stenning, David C. [5 ]
机构
[1] Imperial Coll London, Dept Math, London, England
[2] SISSA, Dept Phys, Trieste, Italy
[3] Imperial Coll London, Dept Phys, London, England
[4] Ctr Nazl High Performance Comp Big Data & Quantum, Casalecchio Di Reno, Italy
[5] Simon Fraser Univ, Dept Stat & Actuarial Sci, Burnaby, BC, Canada
基金
英国工程与自然科学研究理事会; 加拿大自然科学与工程研究理事会;
关键词
astrostatistics; bias reduction; domain adaptation; machine learning; propensity scores; SUPERNOVA PHOTOMETRIC CLASSIFICATION; PROPENSITY SCORE; BIAS; SUBCLASSIFICATION; ADAPTATION; INFERENCE;
D O I
10.1002/sam.11643
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a simple, statistically principled, and theoretically justified method to improve supervised learning when the training set is not representative, a situation known as covariate shift. We build upon a well-established methodology in causal inference and show that the effects of covariate shift can be reduced or eliminated by conditioning on propensity scores. In practice, this is achieved by fitting learners within strata constructed by partitioning the data based on the estimated propensity scores, leading to approximately balanced covariates and much-improved target prediction. We refer to the overall method as Stratified Learning, or StratLearn. We demonstrate the effectiveness of this general-purpose method on two contemporary research questions in cosmology, outperforming state-of-the-art importance weighting methods. We obtain the best-reported AUC (0.958) on the updated "Supernovae photometric classification challenge," and we improve upon existing conditional density estimation of galaxy redshift from Sloan Digital Sky Survey (SDSS) data.
引用
收藏
页数:16
相关论文
共 56 条