Representing a Model for the Anonymization of Big Data Stream Using In-Memory Processing

被引:0
作者
Shamsinejad E. [1 ]
Banirostam T. [1 ]
Pedram M.M. [2 ]
Rahmani A.M. [3 ]
机构
[1] Department of Computer Engineering, Central Tehran Branch, Islamic Azad University, Tehran
[2] Department of Electrical and Computer Engineering, Faculty of Engineering, Kharazmi University, Tehran
[3] Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran
关键词
Anonymity; Big data; Confidentiality; Data disclosure; Privacy;
D O I
10.1007/s40745-024-00556-x
中图分类号
学科分类号
摘要
In light of the escalating privacy risks in the big data era, this paper introduces an innovative model for the anonymization of big data streams, leveraging in-memory processing within the Spark framework. The approach is founded on the principle of K-anonymity and propels the field forward by critically evaluating various anonymization methods and algorithms, benchmarking their performance with respect to time and space complexities. A distinctive formula for optimized cluster determination in the K-means algorithm is presented, along with a novel tuple expiration time strategy for the efficient purging of clusters. The integration of these components into Spark’s RDD and MLlib modules results in a significant decrease in execution time and data loss rates, even with increasing data volumes. The paper’s notable contributions are its methodological advancements that offer a robust, scalable solution for data anonymization, safeguarding user privacy without sacrificing data utility or processing efficiency. © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.
引用
收藏
页码:223 / 252
页数:29
相关论文
共 39 条
  • [1] Strang K.D., Sun Z., Big data paradigm: What is the status of privacy and security?, Ann Data Sci, 4, pp. 1-17, (2017)
  • [2] Xu Z., Shi Y., Exploring big data analysis: fundamental scientific problems, Ann Data Sci, 2, pp. 363-372, (2015)
  • [3] Shi Y., Advances in big data analytics: Theory, algorithm and practice, (2022)
  • [4] Olson D.L., Shi Y., Introduction to business data mining, (2007)
  • [5] Shi Y., Tian Y.J., Kou G., Peng Y., Li J.P., Optimization based data mining: theory and applications, Springer, (2011)
  • [6] Tien J.M., Internet of things, real-time decision making, and artificial intelligence, Ann Data Sci, 4, 2, pp. 149-178, (2017)
  • [7] Du D., Li A., Zhang L., Li H., Review on the applications and the handling techniques of big data in Chinese realty enterprises, Ann Data Sci, 1, pp. 339-357, (2014)
  • [8] Luan H., Kun H.X., Qun F., Han Z., Yang L.Y., Lin S.Q., Qing W., A survey of text summarization approaches based on deep learning, J Comput Sci Technol, 36, pp. 633-663, (2021)
  • [9] Jadhav P.S., Borkar G.M., Optimal key generation for privacy preservation in big data applications based on the marine predator whale optimization algorithm, Ann Data Sci, (2024)
  • [10] Zheng W., Ma Y., Wang Z., Jia C., Li P., Effective L-diversity anonymization algorithm based on improved clustering, Cyberspace Safety and Security. Lecture Notes in Computer Science, 11983, (2019)