Fairness in Data Wrangling

被引:6
作者
Mazilu, Lacramioara [1 ]
Paton, Norman W. [1 ]
Konstantinou, Nikolaos [1 ]
Fernandes, Alvaro A. A. [1 ]
机构
[1] Univ Manchester, Sch Comp Sci, Manchester, Lancs, England
来源
2020 IEEE 21ST INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION FOR DATA SCIENCE (IRI 2020) | 2020年
基金
英国工程与自然科学研究理事会;
关键词
data wrangling; fairness; bias; sample size disparity; proxy attribute; training dataset; CLASSIFICATION;
D O I
10.1109/IRI49571.2020.00056
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
At the core of many data analysis processes lies the challenge of properly gathering and transforming data. This problem is known as data wrangling, and it can become even more challenging if the data sources that need to be transformed are heterogeneous and autonomous, i.e., have different origins, and if the output is meant to be used as a training dataset, thus, making it paramount for the dataset to be fair. Given the rise in usage of artificial intelligence (AI) systems for a variety of domains, it is necessary to take into account fairness issues while building these systems. In this paper, we aim to bridge the gap between gathering the data and making the datasets fair by proposing a method for performing data wrangling while considering fairness. To this end, our method comprises a data wrangling pipeline whose behaviour can be adjusted through a set of parameters. Based on the fairness metrics run on the output datasets, the system plans a set of data wrangling interventions with the aim of lowering the bias in the output dataset. The system uses Tabu Search to explore the space of candidate interventions. In this paper we consider two potential sources of dataset bias: those arising from unequal representation of sensitive groups and those arising from hidden biases through proxies for sensitive attributes. The approach is evaluated empirically.
引用
收藏
页码:341 / 348
页数:8
相关论文
共 23 条
  • [1] Barocas S., 2019, Fairness and Machine Learning
  • [2] Bleiholder J., 2010, EDBT, P513
  • [3] Building Classifiers with Independency Constraints
    Calders, Toon
    Kamiran, Faisal
    Pechenizkiy, Mykola
    [J]. 2009 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2009), 2009, : 13 - 18
  • [4] Chawla N.V., 2005, INT C KNOWLEDGE DISC, P24, DOI DOI 10.1145/1089827.1089830
  • [5] SMOTE: Synthetic minority over-sampling technique
    Chawla, Nitesh V.
    Bowyer, Kevin W.
    Hall, Lawrence O.
    Kegelmeyer, W. Philip
    [J]. 2002, American Association for Artificial Intelligence (16)
  • [6] Conscientious Classification: A Data Scientist's Guide to Discrimination-Aware Classification
    d'Alessandro, Brian
    O'Neil, Cathy
    LaGatta, Tom
    [J]. BIG DATA, 2017, 5 (02) : 120 - 134
  • [7] Dua Dheeru, 2017, UCI Machine Learning Repository
  • [8] Certifying and Removing Disparate Impact
    Feldman, Michael
    Friedler, Sorelle A.
    Moeller, John
    Scheidegger, Carlos
    Venkatasubramanian, Suresh
    [J]. KDD'15: PROCEEDINGS OF THE 21ST ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2015, : 259 - 268
  • [9] A comparative study of fairness-enhancing interventions in machine learning
    Friedler, Sorelle A.
    Scheidegger, Carlos
    Venkatasubramanian, Suresh
    Choudhary, Sonam
    Hamilton, Evan P.
    Roth, Derek
    [J]. FAT*'19: PROCEEDINGS OF THE 2019 CONFERENCE ON FAIRNESS, ACCOUNTABILITY, AND TRANSPARENCY, 2019, : 329 - 338
  • [10] Furche T., 2016, EDBT, V16, P473