Improving duplicate elimination in storage systems

被引:72
作者
Purdue University [1 ]
不详 [2 ]
不详 [3 ]
不详 [4 ]
机构
[1] Department of Computer Science, Purdue University, West Lafayette
[2] NEC Laboratories America, Princeton, NJ 08540
来源
ACM Trans. Storage | 2006年 / 4卷 / 424-448期
关键词
Content-based addressing; Duplicate elimination; Rabin's fingerprints; Storage management;
D O I
10.1145/1210596.1210599
中图分类号
学科分类号
摘要
Minimizing the amount of data that must be stored and managed is a key goal for any storage architecture that purports to be scalable. One way to achieve this goal is to avoid maintaining duplicate copies of the same data. Eliminating redundant data at the source by not writing data which has already been stored not only reduces storage overheads, but can also improve bandwidth utilization. For these reasons, in the face of today's exponentially growing data volumes, redundant data elimination techniques have assumed critical significance in the design of modern storage systems.Intelligent object partitioning techniques identify data that is new when objects are updated, and transfer only these chunks to a storage server. In this article, we propose a new object partitioning technique, called fingerdiff, that improves upon existing schemes in several important respects. Most notably, fingerdiff dynamically chooses a partitioning strategy for a data object based on its similarities with previously stored objects in order to improve storage and bandwidth utilization. We present a detailed evaluation of fingerdiff, and other existing object partitioning schemes, using a set of real-world workloads. We show that for these workloads, the duplicate elimination strategies employed by fingerdiff improve storage utilization on average by 25%, and bandwidth utilization on average by 40% over comparable techniques. © 2006 ACM.
引用
收藏
页码:424 / 448
页数:24
相关论文
共 29 条
  • [1] AJTAI M., BURNS R., FAGIN R., LONG D., STOCKMEYER L., Compactly encoding unstructured input with differential compression, J. ACM, 49, 3, pp. 318-367, (2000)
  • [2] BERLEKAMP E.R., Algebraic Coding Theory, (1968)
  • [3] BLOMER J., KALFANE M., KARP R., KARPINSKI M., LUBY M., ZUCKERMAN D., An xor-based erasure-resilient coding scheme, (1995)
  • [4] BRODER A., On the resemblance and containment of documents, Proceedings of the Compression and Complexity of Sequences Conference, (1997)
  • [5] BRODER A., GLASSMAN S., MANASSE M., ZWEIG G., Syntactic clustering of the web, Proceedings of the 6th International, pp. 391-404, (1997)
  • [6] BRODER A.Z., Identifying and filtering near-duplicate documents, Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, pp. 1-10, (2000)
  • [7] COX L., MURRAY C., NOBLE B., Pastiche: Making backup cheap and easy, Proceedings of the 5th USENIX Symposium on Operating Systems Design and Implementation, (2002)
  • [8] DOUGLIS F., IYENGAR A., Application-Specific deltaencoding via resemblance detection, Usenix Annual Technical Conference, pp. 59-72, (2003)
  • [9] DOUGLIS P.K.F., LAVOIE J., TRACEY J.M., Redundancy elimination within large collections of files, Usenix Annual Technical Conference, pp. 59-72, (2004)
  • [10] GOLDBERG A.V., YIANILOS P.N., Towards an archival intermemory, Proceedings of the Advances in Digital Libraries Conference, (1998)