A Differentially Private Scheme for Top-k Frequent Itemsets Mining Over Data Streams

被引:0
作者
Liang W.-J. [1 ,2 ,3 ]
Chen H. [1 ,2 ]
Zhao S.-Y. [1 ,2 ]
Li C.-P. [1 ,2 ]
机构
[1] Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education, Renmin University of China, Beijing
[2] School of Information, Renmin University of China, Beijing
[3] College of Computer and Information Engineering, Henan University, Kaifeng
来源
Jisuanji Xuebao/Chinese Journal of Computers | 2021年 / 44卷 / 04期
基金
中国国家自然科学基金;
关键词
Differential privacy; Frequent itemsets mining; Pattern estimation; Transaction splitting; Weighted reservoir sampling;
D O I
10.11897/SP.J.1016.2021.00741
中图分类号
学科分类号
摘要
Frequent Itemset Mining(FIM) is a popular technique for analyzing transactional data. Currently, continually mining frequent patterns over data streams have important application value. However, if these data are sensitive, directly releasing frequent patterns and their true supports may disclose the privacy of individuals. The protection of user privacy while obtaining statistical information is important. Differential privacy, as a rigid and provable privacy model, has recently gained significant attention in frequent itemset mining. Several differentially private schemes for FIM have been proposed. However, most of them are designed for static datasets and focus on privacy preservation for one-shot release of frequent patterns. In this paper, we focus on differentially private FIM over data streams, and aim to design a private continual release scheme which can not only achieve high data utility and a high degree of privacy, but also offer high time efficiency. Compared to the static release, the privacy preservation processing in FIM over data streams should meet the real-time requirement while achieving high data utility. Therefore, two challenges will be faced when designing the private scheme for FIM over data streams: the first is the cumulative consumption of privacy budget in the continual release process, which may result in low data utility. The second is the enlargement of candidate itemsets, which may cause a high error of the released result and low release efficiency. Some strategies need to be designed to solve the two challenges to promote the data utility and release efficiency. To solve the first challenge, a strategy of event level privacy for FIM over data streams is designed. Benefitting from this strategy, the utilization of the privacy budget for the continual release can be maximized, and thus the data utility can be improved. To solve the second challenge, an efficient transaction splitting method based on count estimation under differential privacy is first proposed in the preprocessing stage. Since splitting may cause a certain amount of information loss, the information loss is then analyzed and compensated in the publishing stage. Benefitting from transaction splitting, both the candidate itemsets and the information loss can be reduced, and thus the error of the released result can also be significantly reduced. Then in the continually publishing stage, when mining based on Cantree, the generated candidate patterns are compressed based on the support threshold to further reduce the error of the released result. Only the patterns whose support is smaller than the threshold can be retained in the sampling set. Based on the reduced sampling set, an incremental update release scheme is designed by combining weighted reservoir sampling with exponential mechanism (EM). The advantage of this combination is the sampling set can be traversed only once when finding the top-k frequent patterns, which can promote the release efficiency while not affecting the privacy level and data utility. By formal privacy analysis, we theoretically show that our scheme is ε-differentially private. Finally, extensive experiments on real-world datasets illustrate that our scheme outperforms the state-of-the-art comparison methods. © 2021, Science Press. All right reserved.
引用
收藏
页码:741 / 760
页数:19
相关论文
共 43 条
[1]  
Garofalakis M N, Gehrke J, Rastogi R., Querying and mining data streams: You only get one look a tutorial, Proceedings of the International Conference on Management of Data(SIGMOD), (2002)
[2]  
Tanbeer S K, Ahmed C F, Jeong B, Lee Y., Sliding window-based frequent pattern mining over data streams, Information Sciences, 179, 22, pp. 3843-3865, (2009)
[3]  
Dwork C, Roth A., The algorithmic foundations of differential privacy, Foundations and Trends in Theoretical Computer Science, 9, 3, pp. 211-407, (2014)
[4]  
Dwork C, McSherry F, Nissim K, Smith A D., Calibrating noise to sensitivity in private data analysis, Proceedings of the International Conference on Theory of Cryptography (TCC), pp. 265-284, (2006)
[5]  
McSherry F, Talwar K., Mechanism design via differential privacy, Proceedings of the IEEE Annual Symposium on Foundations of Computer Science (FOCS), pp. 94-103, (2007)
[6]  
Nanavati N R, Jinwala D C., A novel privacy-preserving scheme for collaborative frequent itemset mining across vertically partitioned data, Security and Communication Networks, 8, 18, pp. 4407-4420, (2015)
[7]  
Wong W K, Cheung D W, Et al., Security in outsourcing of association rule mining, Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 111-122, (2007)
[8]  
Wong W K, Cheung D W, Et al., An audit environment for outsourcing of frequent itemset mining, Proceedings of the VLDB Endowment (PVLDB), 2, 1, pp. 1162-1172, (2009)
[9]  
Atzori M, Bonchi F, Et al., Anonymity preserving pattern discovery, Proceedings of the VLDB Endowment, 17, 4, pp. 703-727, (2008)
[10]  
Loukides G, Gkoulalas-Divanis A, Shao Jian-Hua, Efficient and flexible anonymization of transaction data, Knowledge and Information Systems, 36, 1, pp. 153-210, (2013)