A hybrid anonymization pipeline to improve the privacy-utility balance in sensitive datasets for ML purposes

被引:3
作者
Verdonck, Jenno [1 ]
De Boeck, Kevin [1 ]
Willocx, Michiel [1 ]
Lapon, Jorn [1 ]
Naessens, Vincent [1 ]
机构
[1] Imec DistriNet, Ghent, Belgium
来源
18TH INTERNATIONAL CONFERENCE ON AVAILABILITY, RELIABILITY & SECURITY, ARES 2023 | 2023年
关键词
Anonymity; privacy; utility; ML; K-ANONYMITY;
D O I
10.1145/3600160.3600168
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The modern world is data-driven. Businesses increasingly take strategic decisions based on customer data, and companies are founded with a sole focus of performing machine-learning driven data analytics for third parties. External data sources containing sensitive records are often required to build qualitative machine learning models and, hence, perform accurate and meaningful predictions. However, exchanging sensitive datasets is no sinecure. Personal data must be managed according to privacy regulation. Similarly, loss of strategic data can negatively impact the competitiveness of a company. In both cases, dataset anonymization can overcome the aforementioned obstacles. This work proposes a hybrid anonymization pipeline combining masking and (intelligent) sampling to improve the privacy-utility balance of anonymized datasets. The approach is validated via in-depth experiments on a representative machine learning scenario. A quantitative privacy assessment of the proposed hybrid anonymization pipeline is performed and relies on two well-known privacy metrics, namely re-identification risk and certainty. Furthermore, this work shows that the utility level of the anonymized dataset remains acceptable, and that the overall privacy-utility balance increases when complementing masking with intelligent sampling. The study further restrains the common misconception that dataset anonymization is detrimental to the quality of machine learning models. The empirical study shows that anonymous datasets - generated by the hybrid anonymization pipeline - can compete with the original (identifiable) ones when they are used as input for training a machine learning model.
引用
收藏
页数:24
相关论文
共 41 条
  • [1] Almasi MM, 2016, 2016 8TH IFIP INTERNATIONAL CONFERENCE ON NEW TECHNOLOGIES, MOBILITY AND SECURITY (NTMS)
  • [2] Almishari Mishari, 2012, Computer Security - ESORICS 2012. Proceedings 17th European Symposium on Research in Computer Security, P307, DOI 10.1007/978-3-642-33167-1_18
  • [3] [Anonymous], 2016, Official Journal of the European Union
  • [4] [Anonymous], 1998, PROTECTING PRIVACY D
  • [5] Barth -Jones Daniel, 2012, Then and NowJuly
  • [6] Bender Stefan., 2001, STAT J UN ECE, V18, P373
  • [7] Bild Raffael, 2018, Proceedings on Privacy Enhancing Technologies, V2018, P67, DOI 10.1515/popets-2018-0004
  • [8] The Compromise of Data Privacy in Predictive Performance
    Carvalho, Tania
    Moniz, Nuno
    [J]. ADVANCES IN INTELLIGENT DATA ANALYSIS XIX, IDA 2021, 2021, 12695 : 426 - 438
  • [9] Spatio-temporal crime predictions in smart cities: A data-driven approach and experiments
    Catlett, Charlie
    Cesario, Eugenio
    Talia, Domenico
    Vinci, Andrea
    [J]. PERVASIVE AND MOBILE COMPUTING, 2019, 53 : 62 - 74
  • [10] Chaudhuri K, 2006, LECT NOTES COMPUT SC, V4117, P198