Where You Are Is Who You Are: User Identification by Matching Statistics

被引:81
作者
Naini, Farid M. [1 ]
Unnikrishnan, Jayakrishnan [1 ]
Thiran, Patrick [1 ]
Vetterli, Martin [1 ]
机构
[1] Ecole Polytech Fed Lausanne, CH-1015 Lausanne, Switzerland
关键词
Data privacy; de-anonymization; identification of persons; DE-ANONYMIZATION; PRIVACY;
D O I
10.1109/TIFS.2015.2498131
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Most users of online services have unique behavioral or usage patterns. These behavioral patterns can be exploited to identify and track users by using only the observed patterns in the behavior. We study the task of identifying users from statistics of their behavioral patterns. In particular, we focus on the setting in which we are given histograms of users' data collected during two different experiments. We assume that, in the first data set, the users' identities are anonymized or hidden and that, in the second data set, their identities are known. We study the task of identifying the users by matching the histograms of their data in the first data set with the histograms from the second data set. In recent works, the optimal algorithm for this user identification task is introduced. In this paper, we evaluate the effectiveness of this method on three different types of data sets with up to 50 000 users, and in multiple scenarios. Using data sets such as call data records, web browsing histories, and GPS trajectories, we demonstrate that a large fraction of users can be easily identified given only histograms of their data; hence, these histograms can act as users' fingerprints. We also verify that simultaneous identification of users achieves better performance compared with one-by-one user identification. Furthermore, we show that using the optimal method for identification indeed gives higher identification accuracy than the heuristics-based approaches in the practical scenarios. The accuracy obtained under this optimal method can thus be used to quantify the maximum level of user identification that is possible in such settings. We show that the key factors affecting the accuracy of the optimal identification algorithm are the duration of the data collection, the number of users in the anonymized data set, and the resolution of the data set. We also analyze the effectiveness of k-anonymization in resisting user identification attacks on these data sets.
引用
收藏
页码:358 / 372
页数:15
相关论文
共 44 条
[1]   Doppelganger Finder: Taking Stylometry To The Underground [J].
Afroz, Sadia ;
Caliskan-Islam, Aylin ;
Stolerman, Ariel ;
Greenstadt, Rachel ;
McCoy, Damon .
2014 IEEE SYMPOSIUM ON SECURITY AND PRIVACY (SP 2014), 2014, :212-226
[2]  
Aggarwal Charu C, 2008, A general survey of privacy-preserving data mining models and algorithms
[3]  
[Anonymous], 2008, Proceedings of the International Conference on Empirical Methods Conference in Natural Language Processing
[4]  
[Anonymous], 2013, PROC 6 ACM INT C WEB, DOI 10.1145/2433396.2433457
[5]  
Blondel V.D., 2013, Data for development: The D4D challenge on mobile phone data
[6]   A survey on privacy in mobile participatory sensing applications [J].
Christin, Delphine ;
Reinhardt, Andreas ;
Kanhere, Salil S. ;
Hollick, Matthias .
JOURNAL OF SYSTEMS AND SOFTWARE, 2011, 84 (11) :1928-1946
[7]  
Cook W. J., 2011, WILEY INTERSCIENCE S
[8]  
Cover TM., 1999, ELEMENTS INFORM THEO, DOI DOI 10.1002/047174882X
[9]  
Danezis G, 2009, LECT NOTES COMPUT SC, V5672, P56, DOI 10.1007/978-3-642-03168-7_4
[10]   Unique in the Crowd: The privacy bounds of human mobility [J].
de Montjoye, Yves-Alexandre ;
Hidalgo, Cesar A. ;
Verleysen, Michel ;
Blondel, Vincent D. .
SCIENTIFIC REPORTS, 2013, 3