Data Anonymization for Privacy Aware Machine Learning

被引：5

作者：

Jaidan, David Nizar ^{[1
]}

Carrere, Maxime ^{[2
]}

Chemli, Zakaria ^{[3
]}

Poisvert, Remi ^{[4
]}

机构：

[1] Innovat L B Scalian France, Labege, France

[2] Ctr Excellence Datascale Scalian France, Le Haillan, France

[3] Innovat L B Scalian France, Paris, France

[4] Innovat L B Scalian France, Rennes, France

来源：

MACHINE LEARNING, OPTIMIZATION, AND DATA SCIENCE | 2019年 / 11943卷

关键词：

Privacy; Anonymization; Machine learning; Text encoding; Natural language processing; Time series; Anomaly detection;

D O I：

10.1007/978-3-030-37599-7_60

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The increase of data leaks, attacks, and other ransom-ware in the last few years have pointed out concerns about data security and privacy. All this has negatively affected the sharing and publication of data. To address these many limitations, innovative techniques are needed for protecting data. Especially, when used in machine learning based-data models. In this context, differential privacy is one of the most effective approaches to preserve privacy. However, the scope of differential privacy applications is very limited (e. g. numerical and structured data). Therefore, in this study, we aim to investigate the behavior of differential privacy applied to textual data and time series. The proposed approach was evaluated by comparing two Principal Component Analysis based differential privacy algorithms. The effectiveness was demonstrated through the application of three machine learning models to both anonymized and primary data. Their performances were thoroughly evaluated in terms of confidentiality, utility, scalability, and computational efficiency. The PPCA method provides a high anonymization quality at the expense of a high time-consuming, while the DPCA method preserves more utility and faster time computing. We show the possibility to combine a neural network text representation approach with differential privacy methods. We also highlighted that it is well within reach to anonymize real-world measurements data from satellites sensors for an anomaly detection task. We believe that our study will significantly motivate the use of differential privacy techniques, which can lead to more data sharing and privacy preserving.

引用

页码：725 / 737

页数：13

共 28 条

[1]

Albrecht J. P., 2016, Eur. Data Protection Law Rev., V2, P287, DOI DOI 10.21552/EDPL/2016/3/4

[2]

[Anonymous], 12 ACM SIGKDD INT C, DOI DOI 10.1145/1150402.1150499

[3] Privacy-Preserving Generative Deep Neural Networks Support Clinical Data Sharing [J].

Beaulieu-Jones, Brett K. ;

Wu, Zhiwei Steven ;

Williams, Chris ;

Lee, Ran ;

Bhavnani, Sanjeev P. ;

Byrd, James Brian ;

Greene, Casey S. .

CIRCULATION-CARDIOVASCULAR QUALITY AND OUTCOMES, 2019, 12 (07)

[4] Random forests [J].

Breiman, L .

MACHINE LEARNING, 2001, 45 (01) :5-32

[5]

Chaudhuri K, 2013, J MACH LEARN RES, V14, P2905

[6]

Chaudhuri K, 2011, J MACH LEARN RES, V12, P1069

[7] XGBoost: A Scalable Tree Boosting System [J].

Chen, Tianqi ;

Guestrin, Carlos .

KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794

[8]

Dwork C., 2011, Differential Privacy, V2nd, P338, DOI DOI 10.1007/978-1-4419-5906-5752

[9]

Fernandes Natasha, 2019, Principles of Security and Trust. 8th International Conference, POST 2019. Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2019. Proceedings: Lecture Notes in Computer Science (11426), P123, DOI 10.1007/978-3-030-17138-4_6

[10] Processing Text for Privacy: An Information Flow Perspective [J].

Fernandes, Natasha ;

Dras, Mark ;

McIver, Annabelle .

FORMAL METHODS, 2018, 10951 :3-21

← 1 2 3 →