Fairness in Data Wrangling

被引：6

作者：

Mazilu, Lacramioara ^{[1
]}

Paton, Norman W. ^{[1
]}

Konstantinou, Nikolaos ^{[1
]}

Fernandes, Alvaro A. A. ^{[1
]}

机构：

[1] Univ Manchester, Sch Comp Sci, Manchester, Lancs, England

来源：

2020 IEEE 21ST INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION FOR DATA SCIENCE (IRI 2020) | 2020年

基金：

英国工程与自然科学研究理事会;

关键词：

data wrangling; fairness; bias; sample size disparity; proxy attribute; training dataset; CLASSIFICATION;

D O I：

10.1109/IRI49571.2020.00056

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

At the core of many data analysis processes lies the challenge of properly gathering and transforming data. This problem is known as data wrangling, and it can become even more challenging if the data sources that need to be transformed are heterogeneous and autonomous, i.e., have different origins, and if the output is meant to be used as a training dataset, thus, making it paramount for the dataset to be fair. Given the rise in usage of artificial intelligence (AI) systems for a variety of domains, it is necessary to take into account fairness issues while building these systems. In this paper, we aim to bridge the gap between gathering the data and making the datasets fair by proposing a method for performing data wrangling while considering fairness. To this end, our method comprises a data wrangling pipeline whose behaviour can be adjusted through a set of parameters. Based on the fairness metrics run on the output datasets, the system plans a set of data wrangling interventions with the aim of lowering the bias in the output dataset. The system uses Tabu Search to explore the space of candidate interventions. In this paper we consider two potential sources of dataset bias: those arising from unequal representation of sensitive groups and those arising from hidden biases through proxies for sensitive attributes. The approach is evaluated empirically.

引用

页码：341 / 348

页数：8

共 23 条

[1] Barocas S., 2019, Fairness and Machine Learning
[2] Bleiholder J., 2010, EDBT, P513
[3] Building Classifiers with Independency Constraints
Calders, Toon
Kamiran, Faisal
Pechenizkiy, Mykola
[J]. 2009 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2009), 2009, : 13 - 18
[4] Chawla N.V., 2005, INT C KNOWLEDGE DISC, P24, DOI DOI 10.1145/1089827.1089830
[5] SMOTE: Synthetic minority over-sampling technique
Chawla, Nitesh V.
Bowyer, Kevin W.
Hall, Lawrence O.
Kegelmeyer, W. Philip
[J]. 2002, American Association for Artificial Intelligence (16)
[6] Conscientious Classification: A Data Scientist's Guide to Discrimination-Aware Classification
d'Alessandro, Brian
O'Neil, Cathy
LaGatta, Tom
[J]. BIG DATA, 2017, 5 (02) : 120 - 134
[7] Dua Dheeru, 2017, UCI Machine Learning Repository
[8] Certifying and Removing Disparate Impact
Feldman, Michael
Friedler, Sorelle A.
Moeller, John
Scheidegger, Carlos
Venkatasubramanian, Suresh
[J]. KDD'15: PROCEEDINGS OF THE 21ST ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2015, : 259 - 268
[9] A comparative study of fairness-enhancing interventions in machine learning
Friedler, Sorelle A.
Scheidegger, Carlos
Venkatasubramanian, Suresh
Choudhary, Sonam
Hamilton, Evan P.
Roth, Derek
[J]. FAT*'19: PROCEEDINGS OF THE 2019 CONFERENCE ON FAIRNESS, ACCOUNTABILITY, AND TRANSPARENCY, 2019, : 329 - 338
[10] Furche T., 2016, EDBT, V16, P473

← 1 2 3 →