Hash-Indexing Block-Based Deduplication Algorithm for Reducing Storage in the Cloud

被引:0
作者
Viji D. [1 ]
Revathy S. [1 ]
机构
[1] Department of Computer Science and Engineering, Sathyabama Institute of Science and Technology, Chennai
来源
Computer Systems Science and Engineering | 2023年 / 46卷 / 01期
关键词
Cloud computing; cloud storage; deduplication; hash indexing; record linkage; relational content analysis document clustering;
D O I
10.32604/csse.2023.030259
中图分类号
学科分类号
摘要
Cloud storage is essential for managing user data to store and retrieve from the distributed data centre. The storage service is distributed as pay a service for accessing the size to collect the data. Due to the massive amount of data stored in the data centre containing similar information and file structures remaining in multi-copy, duplication leads to increase storage space. The potential deduplication system doesn't make efficient data reduction because of inaccuracy in finding similar data analysis. It creates a complex nature to increase the storage consumption under cost. To resolve this problem, this paper proposes an efficient storage reduction called Hash-Indexing Block-based Deduplication (HIBD) based on Segmented Bind Linkage (SBL) Methods for reducing storage in a cloud environment. Initially, preprocessing is done using the sparse augmentation technique. Further, the preprocessed files are segmented into blocks to make Hash-Index. The block of the contents is compared with other files through Semantic Content Source Deduplication (SCSD), which identifies the similar content presence between the file. Based on the content presence count, the Distance Vector Weightage Correlation (DVWC) estimates the document similarity weight, and related files are grouped into a cluster. Finally, the segmented bind linkage compares the document to find duplicate content in the cluster using similarity weight based on the coefficient match case. This implementation helps identify the data redundancy efficiently and reduces the service cost in distributed cloud storage. © 2023 CRL Publishing. All rights reserved.
引用
收藏
页码:27 / 42
页数:15
相关论文
共 22 条
  • [1] Xia W., Jiang H., Feng D., Hua Y., Similarity and locality based indexing for high-performance data deduplication, IEEE Transactions on Computers, 64, 4, pp. 1162-1176, (2015)
  • [2] Khan A., Hamandawana P., Kim Y., A content fingerprint-based cluster-wide inline deduplication for shared-nothing storage systems, IEEE Access, 8, 3, pp. 209163-209180, (2020)
  • [3] Padmanaban M., Bhuvaneswari T., An approach based on artificial neural network for data deduplication, International Journal of Computer Science and Information Technologies, 3, 4, pp. 4637-4644, (2012)
  • [4] Tan Y., Jiang H., Feng D., Tian L., Yan Z., Et al., SAM: A semantic-aware multi-tiered source deduplication framework for cloud backup, Int. Conf. on Parallel Processing, pp. 614-623, (2010)
  • [5] Clements A. T., Ahmad I., Vilayannur M., Li J., Decentralized deduplication in SAN cluster file systems, Proc. of the Conf. on USENIX Annual Technical Conf, pp. 1-8, (2009)
  • [6] Christen P., A survey of indexing techniques for scalable record linkage and deduplication, IEEE Transactions on Knowledge and Data Engineering, 24, 9, pp. 1537-1555, (2012)
  • [7] Mishra M., Sengar S. S., E-David: An efficient distributed architecture for inline data deduplication, Conf. on Communication Systems and Network Technologies, pp. 438-442, (2012)
  • [8] Botsis T., Scott J., Woo E. J., Ball R., Identifying similar cases in document networks using cross-reference structures, IEEE Journal of Biomedical and Health Informatics, 19, 6, pp. 1906-1917, (2015)
  • [9] Papadimitriou D., Koutrika G., Velegrakis Y., Mylopoulos J., Finding related forum posts through content similarity over intention-based segmentation, IEEE Transactions on Knowledge and Data Engineering, 29, 9, pp. 1860-1873, (2017)
  • [10] Lanterna D., Barili A., Forensic analysis of de-duplicated file systems, Digital Investigation, 20, 4, pp. 99-106, (2017)