Privacy and data mining: evaluating the impact of data anonymization on classification algorithms

被引:8
作者
Silva, Hebert O. [1 ]
Basso, Tania [1 ]
Moraes, Regina [1 ]
机构
[1] Univ Estadual Campinas, Sch Technol, BR-30332025 Sao Paulo, Brazil
来源
2017 13TH EUROPEAN DEPENDABLE COMPUTING CONFERENCE (EDCC 2017) | 2017年
关键词
Privacy; Anonymization; Data Mining; Classification; Data utility;
D O I
10.1109/EDCC.2017.17
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Data anonymization is a technique used to increase the assurance that private data is not accessible to third parties. In data mining processes, anonymization can impact the results, since anonymized data may hinder the data analysis performed by algorithms commonly used in this context. The goal of this Practical Experience Report is to evaluate the accuracy and performance impact of data anonymization on data mining classifiers results. This is done through comparisons of their execution using original and anonymized data. A sample of real data generated by a Brazilian city transportation system associated to fictitious users was anonymized at different stages and classification algorithms, such as ZeroR, KNN (k-Nearest Neighbor), and Naive Bayes, were applied. Contrary to expectations, when the anonymization techniques were introduced in some classes, the accuracy was raised, as well as performance, reducing execution time. These results demonstrate that data anonymization techniques, when properly applied, can contribute to the effectiveness of data mining classifiers.
引用
收藏
页码:111 / 116
页数:6
相关论文
共 14 条
  • [1] [Anonymous], 2008, P 14 ACM SIGKDD INT, DOI DOI 10.1145/1401890.1401904
  • [2] ARX, 2017, ARX DAT AN TOOL
  • [3] Challenges on Anonymity, Privacy and Big Data
    Basso, Tania
    Matsunaga, Roberta
    Moraes, Regina
    Antunes, Nuno
    [J]. 2016 SEVENTH LATIN-AMERICAN SYMPOSIUM ON DEPENDABLE COMPUTING (LADC), 2016, : 164 - 171
  • [4] Buratovic I., 2012, 2012 35th International Convention on Information and Communication Technology, Electronics and Microelectronics, P1619
  • [5] Clarke Roger, 1997, Introduction to dataveillance and information privacy, and definitions of terms
  • [6] EUBra-BIGSEA, 2017, EUBR BIGS EUR BRAZ C
  • [7] Han J, 2012, MOR KAUF D, P1
  • [8] Using Anonymized Data for Classification
    Inan, Ali
    Kantarcioglu, Murat
    Bertino, Elisa
    [J]. ICDE: 2009 IEEE 25TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2009, : 429 - +
  • [9] Matsunaga R., 2017, ONTOLOGY BASED UNPUB
  • [10] Sayad Saed., 2010, An Introduction to Data Mining