Input-dependent estimation of generalization error under covariate shift

被引:67
作者
Sugiyama, Masashi [1 ,2 ]
Mueller, Klaus-Robert [2 ,3 ]
机构
[1] Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, 2-12-1-W8-74,O Okayama, Tokyo 1528552, Japan
[2] Fraunhofer FIRST, IDA, D-12489 Berlin, Germany
[3] Univ Potsdam, Dept Comp Sci, D-14482 Potsdam, Germany
关键词
Linear regression; generalization error; model selection; covariate shift; sample selection bias; interpolation; extrapolation; active learning; classification with imbalanced data;
D O I
10.1524/stnd.2005.23.4.249
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
A common assumption in supervised learning is that the training and test input points follow the same probability distribution. However, this assumption is not fulfilled, e.g., in interpolation, extrapolation, active learning, or classification with imbalanced data. The violation of this assumption-known as the covariate shift-causes a heavy bias in standard generalization error estimation schemes such as cross-validation or Akaike's information criterion, and thus they result in poor model selection. In this paper, we propose an alternative estimator of the generalization error for the squared loss function when training and test distributions are different. The proposed generalization error estimator is shown to be exactly unbiased for finite samples if the learning target function is realizable and asymptotically unbiased in general. We also show that, in addition to the unbiasedness, the proposed generalization error estimator can accurately estimate the difference of the generalization error among different models, which is a desirable property in model selection. Numerical studies show that the proposed method compares favorably with existingmodel selection methods in regression for extrapolation and in classification with imbalanced data.
引用
收藏
页码:249 / 279
页数:31
相关论文
共 52 条
[1]   NEW LOOK AT STATISTICAL-MODEL IDENTIFICATION [J].
AKAIKE, H .
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1974, AC19 (06) :716-723
[2]  
Albert A., 1972, REGRESSION MOORE PEN
[3]   THEORY OF REPRODUCING KERNELS [J].
ARONSZAJN, N .
TRANSACTIONS OF THE AMERICAN MATHEMATICAL SOCIETY, 1950, 68 (MAY) :337-404
[4]  
Blake C.L., 1998, UCI REPOSITORY MACHI
[5]  
Chawla N. V., 2004, ACM SIGKDD EXPLORATI, V6, P1, DOI DOI 10.1145/1007730.1007733
[6]  
Chen SSB, 2001, SIAM REV, V43, P129, DOI [10.1137/S003614450037906X, 10.1137/S1064827596304010]
[7]   Active learning with statistical models [J].
Cohn, DA ;
Ghahramani, Z ;
Jordan, MI .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 1996, 4 :129-145
[8]   Regularization networks and support vector machines [J].
Evgeniou, T ;
Pontil, M ;
Poggio, T .
ADVANCES IN COMPUTATIONAL MATHEMATICS, 2000, 13 (01) :1-50
[9]  
Fedorov V, 1972, THEORY OPTIMAL EXPT
[10]   A STATISTICAL VIEW OF SOME CHEMOMETRICS REGRESSION TOOLS [J].
FRANK, IE ;
FRIEDMAN, JH .
TECHNOMETRICS, 1993, 35 (02) :109-135