Your Best Guess When You Know Nothing: Identification and Mitigation of Selection Bias

被引：5

作者：

Dost, Katharina ^{[1
]}

Taskova, Katerina ^{[1
]}

Riddle, Patricia ^{[1
]}

Wicker, Jorg ^{[1
]}

机构：

[1] Univ Auckland, Sch Comp Sci, Auckland, New Zealand

来源：

20TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2020) | 2020年

关键词：

D O I：

10.1109/ICDM50108.2020.00115

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Machine Learning typically assumes that training and test sets are independently drawn from the same distribution. but this assumption is often violated in practice which creates a bias. Many attempts to identify and mitigate this bias have been proposed, but they usually rely on ground-truth information. But what if the researcher is not even aware of the bias? In contrast to prior work, this paper introduces a new method, IMITATE, to identify and mitigate Selection Bias in the case that we may not know if (and where) a bias is present, and hence no ground-truth information is available. IMITATE investigates the dataset's probability density, then adds generated points in order to smooth out the density and have it resemble a Gaussian. the most common density occurring in real-world applications. If the artificial points focus on certain areas and are not widespread, this could indicate a Selection Bias where these areas are underrepresented in the sample. We demonstrate the effectiveness of the proposed method in both, synthetic and real-world datasets. We also point out limitations and future research directions.

引用

页码：996 / 1001

页数：6

共 17 条

[1]

[Anonymous], 2014, ADV NEURAL INFORM PR

[2]

[Anonymous], 2004, ICML

[3]

Bareinboim E, 2014, AAAI CONF ARTIF INTE, P2410

[4] AI Fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias [J].

Bellamy, R. K. E. ;

Dey, K. ;

Hind, M. ;

Hoffman, S. C. ;

Houde, S. ;

Kannan, K. ;

Lohia, P. ;

Martino, J. ;

Mehta, S. ;

Mojsilovie, A. ;

Nagar, S. ;

Ramamurthy, K. Natesan ;

Richards, J. ;

Saha, D. ;

Sattigeri, P. ;

Singh, M. ;

Varshney, K. R. ;

Zhang, Y. .

IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 2019, 63 (4-5)

[5]

Bickel S, 2007, P 24 INT C MACH LEAR, P81, DOI DOI 10.1145/1273496.1273507

[6] LOF: Identifying density-based local outliers [J].

Breunig, MM ;

Kriegel, HP ;

Ng, RT ;

Sander, J .

SIGMOD RECORD, 2000, 29 (02) :93-104

[7] THE CENTRAL LIMIT THEOREM FOR DEPENDENT RANDOM VARIABLES [J].

HOEFFDING, W ;

ROBBINS, H .

DUKE MATHEMATICAL JOURNAL, 1948, 15 (03) :773-780

[8] Independent component analysis:: algorithms and applications [J].

Hyvärinen, A ;

Oja, E .

NEURAL NETWORKS, 2000, 13 (4-5) :411-430

[9] Why are Normal Distributions Normal? [J].

Lyon, Aidan .

BRITISH JOURNAL FOR THE PHILOSOPHY OF SCIENCE, 2014, 65 (03) :621-649

[10]

Meagher M, 2007, IEEE INT CONF INF VI, P601

← 1 2 →