Anonymizing NYC Taxi Data: Does It Matter?

被引:51
作者
Douriez, Marie [1 ]
Doraiswamy, Harish [2 ]
Freire, Juliana [2 ]
Silva, Claudio T. [2 ]
机构
[1] Ecole Polytech, F-91128 Palaiseau, France
[2] NYU, New York, NY 10003 USA
来源
PROCEEDINGS OF 3RD IEEE/ACM INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS, (DSAA 2016) | 2016年
关键词
privacy attacks; trajectory privacy; taxi data; spatio-temporal data;
D O I
10.1109/DSAA.2016.21
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The widespread use of location-based services has led to an increasing availability of trajectory data from urban environments. These data carry rich information that are useful for improving cities through traffic management and city planning. Yet, it also contains information about individuals which can jeopardize their privacy. In this study, we work with the New York City (NYC) taxi trips data set publicly released by the Taxi and Limousine Commission (TLC). This data set contains information about every taxi cab ride that happened in NYC. A bad hashing of the medallion numbers ( the ID corresponding to a taxi) allowed the recovery of all the medallion numbers and led to a privacy breach for the drivers, whose income could be easily extracted. In this work, we initiate a study to evaluate whether "perfect" anonymity is possible and if such an identity disclosure can be avoided given the availability of diverse sets of external data sets through which the hidden information can be recovered. This is accomplished through a spatio-temporal join based attack which matches the taxi data with an external medallion data that can be easily gathered by an adversary. Using a simulation of the medallion data, we show that our attack can re-identify over 91% of the taxis that ply in NYC even when using a perfect pseudonymization of medallion numbers. We also explore the effectiveness of trajectory anonymization strategies and demonstrate that our attack can still identify a significant fraction of the taxis in NYC. Given the restrictions in publishing the taxi data by TLC, our results indicate that unless the utility of the data set is significantly compromised, it will not be possible to maintain the privacy of taxi medallion owners and drivers.
引用
收藏
页码:140 / 148
页数:9
相关论文
共 34 条
[1]  
Abul O, 2008, PROC INT CONF DATA, P376, DOI 10.1109/ICDE.2008.4497446
[2]  
[Anonymous], 2012, GEOINDISTINGUISHABIL
[3]  
Bettini C., 2005, P VLDB WORKSH SEC DA, V185199
[4]  
Chen R., 2012, KDDM KDD 12, V12
[5]  
Chow C.-Y., 2011, ACM SIGKDD Explor. Newsl., V13, P19
[6]   Using Topological Analysis to Support Event-Guided Exploration in Urban Data [J].
Doraiswamy, Harish ;
Ferreira, Nivan ;
Damoulas, Theodoros ;
Freire, Juliana ;
Silva, Claudio T. .
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2014, 20 (12) :2634-2643
[7]  
Dwork C, 2006, LECT NOTES COMPUT SC, V4052, P1
[8]  
Ferreira N., 2013, IEEE T VISUALIZATION, V19
[9]  
Feuer A, 2013, MAYORS GEEK SQUAD
[10]  
Ganta S. R., 2008, P 14 ACM SIGKDD INT, P265, DOI [DOI 10.1145/1401890.1401926, 10.1145/1401890.1401926]