PRISPARK: Differential Privacy Enforcement for Big Data Computing in Apache Spark

被引:2
作者
Li, Shuailou [1 ,2 ]
Wen, Yu [1 ]
Xue, Tao [3 ]
Wang, Zhaoyang [1 ,2 ]
Wu, Yanna [1 ]
Meng, Dan [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing, Peoples R China
[3] Xidian Univ, Hangzhou Inst Technol, Hangzhou, Peoples R China
来源
2023 42ND INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, SRDS 2023 | 2023年
关键词
differential privacy; Apache Spark; big data; NOISE; MODEL;
D O I
10.1109/SRDS60354.2023.00019
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Differential privacy has emerged as a gold standard privacy definition due to its persuasive mathematical guarantee. While various data protection mechanisms provide differential privacy for SQL queries of RDBMSs, enforcing differential privacy for big data platforms needs to be further researched. This work presents PRISPARK, which enforces differential privacy for Spark - the advanced distributed engine for large-scale data computing in big data ecosystems where sensitive data is often processed. PRISPARK targets to support various data processing (i.e., relational and unstructured queries) on Spark. In particular, to calculate a tighter sensitivity bound and improve the utility of results, we design the overall statistics estimation algorithm for estimating the upper bound of statistics with the filter condition, and propose a novel fine-grained operation-oriented rules set for calculating sensitivity of various relational and unstructured queries. Moreover, we propose a general differential privacy mechanism, PRISPARK, a suite including PRISPARKSQL and PRISPARKDAG. We enforce PRISPARKSQL at the Catalyst optimization layer for relational queries in Spark SQL and PRISPARKDAG at the RDD execution layer for unstructured queries in Spark core. Finally, we experimentally evaluate PRISPARK on TPC-H, TPC-DS, PigMix benchmarks, and real-world dataset LANL. The experimental results suggest that PRISPARK supports various applications/queries while improving the utility of all query results by orders of magnitude with negligible performance overhead.
引用
收藏
页码:93 / 106
页数:14
相关论文
共 60 条
[1]   The US Census Bureau Adopts Differential Privacy [J].
Abowd, John M. .
KDD'18: PROCEEDINGS OF THE 24TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2018, :2867-2867
[2]   Differentially Private Histogram Publishing through Lossy Compression [J].
Acs, Gergely ;
Castelluccia, Claude ;
Chen, Rui .
12TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2012), 2012, :1-10
[3]  
Alexander D, 2016, Dynamic networks and cyber-security, P37, DOI [DOI 10.1142/97817863407570002, DOI 10.1142/9781786340757_0002]
[4]  
[Anonymous], 2018, Predicate pushdown in parquet and apache spark
[5]  
[Anonymous], 2020, Using apache spark and differential privacy for protecting the privacy of the 2020 census respondents
[6]  
[Anonymous], 2010, P USENIX S NETW SYST
[7]  
[Anonymous], 2020, Spark sql, dataframes and datasets guide
[8]  
[Anonymous], 2022, Tour of scala pattern matching
[9]  
[Anonymous], 2022, Tumult core
[10]  
[Anonymous], 2015, Deep dive into spark sql's catalyst optimizer