Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset

被引:44
作者
Amos, Ryan [1 ]
Acar, Gunes [2 ]
Lucherini, Elena [1 ]
Kshirsagar, Mihir [1 ]
Narayanan, Arvind [1 ]
Mayer, Jonathan [1 ]
机构
[1] Princeton Univ, Princeton, NJ 08544 USA
[2] Katholieke Univ Leuven, Imec COSIC, Leuven, Belgium
来源
PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE 2021 (WWW 2021) | 2021年
关键词
privacy policy; web tracking; data protection; open dataset; SELF-REGULATION; ONLINE; IMPACT;
D O I
10.1145/3442381.3450048
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automated analysis of privacy policies has proved a fruitful research direction, with developments such as automated policy summarization, question answering systems, and compliance detection. Prior research has been limited to analysis of privacy policies from a single point in time or from short spans of time, as researchers did not have access to a large-scale, longitudinal, curated dataset. To address this gap, we developed a crawler that discovers, downloads, and extracts archived privacy policies from the Internet Archive's Wayback Machine. Using the crawler and following a series of validation and quality control steps, we curated a dataset of 1,071,488 English language privacy policies, spanning over two decades and over 130,000 distinct websites. Our analyses of the data paint a troubling picture of the transparency and accessibility of privacy policies. By comparing the occurrence of tracking-related terminology in our dataset to prior web privacy measurements, we find that privacy policies have consistently failed to disclose the presence of common tracking technologies and third parties. We also find that over the last twenty years privacy policies have become even more difficult to read, doubling in length and increasing a full grade in the median reading level. Our data indicate that self-regulation for first-party websites has stagnated, while self-regulation for third parties has increased but is dominated by online advertising trade associations. Finally, we contribute to the literature on privacy regulation by demonstrating the historic impact of the GDPR on privacy policies.
引用
收藏
页码:2165 / 2176
页数:12
相关论文
共 75 条
[51]   Ambiguity in Privacy Policies and the Impact of Regulation [J].
Reidenberg, Joel R. ;
Bhatia, Jaspreet ;
Breaux, Travis D. ;
Norton, Thomas B. .
JOURNAL OF LEGAL STUDIES, 2016, 45 :S163-S190
[52]  
Reidenberg T., 2015, Berkeley Tech. LJ, V30, P39, DOI [10.15779/Z384K33, DOI 10.15779/Z384K33, DOI 10.2139/SSRN]
[53]  
Reitz K., REQUESTS PYPI
[54]   Developing a privacy seal scheme (that works) [J].
Rodrigues, Rowena ;
Wright, David ;
Wadhwa, Kush .
INTERNATIONAL DATA PRIVACY LAW, 2013, 3 (02) :100-116
[55]   Toward a Framework for Detecting Privacy Policy Violations in Android Application Code [J].
Slavin, Rocky ;
Wang, Xiaoyin ;
Hosseini, Mitra Bokaei ;
Hester, James ;
Krishnan, Ram ;
Bhatia, Jaspreet ;
Breaux, Travis D. ;
Niu, Jianwei .
2016 IEEE/ACM 38TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), 2016, :25-36
[56]  
Solove DJ, 2014, COLUMBIA LAW REV, V114, P583
[57]  
Srinath Mukund, 2020, ARXIV200411131
[58]   Availability and quality of mobile health app privacy policies [J].
Sunyaev, Ali ;
Dehling, Tobias ;
Taylor, Patrick L. ;
Mandl, Kenneth D. .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2015, 22 (E1) :E28-E33
[59]  
Swire Peter, 1997, Privacy and Self -Regulation in the Information Age by the U.S. Department of Commerce
[60]  
Thiel Sebastian, 2020, GITPYTHON