Leach: an automatic learning cache for inline primary deduplication system

被引:11
作者
Lin, Bin [1 ]
Li, Shanshan [1 ]
Liao, Xiangke [1 ]
Zhang, Jing [1 ]
Liu, Xiaodong [1 ]
机构
[1] Natl Univ Def Technol, Sch Comp, Changsha 410073, Hunan, Peoples R China
基金
中国国家自然科学基金;
关键词
deduplication; duplicate detection; splay tree; cache;
D O I
10.1007/s11704-014-3377-2
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Deduplication technology has been increasingly used to reduce storage costs. Though it has been successfully applied to backup and archival systems, existing techniques can hardly be deployed in primary storage systems due to the associated latency cost of detecting duplicated data, where every unit has to be checked against a substantially large fingerprint index before it is written. In this paper we introduce Leach, for inline primary storage, a self-learning in-memory fingerprints cache to reduce the writing cost in deduplication system. Leach is motivated by the characteristics of real-world I/O workloads: highly data skew exist in the access patterns of duplicated data. Leach adopts a splay tree to organize the on-disk fingerprint index, automatically learns the access patterns and maintains hot working sets in cachememory, with a goal to service a majority of duplicated data detection. Leveraging the working set property, Leach provides optimization to reduce the cost of splay operations on the fingerprint index and cache updates. In comprehensive experiments on several real-world datasets, Leach outperforms conventional LRU (least recently used) cache policy by reducing the number of cache misses, and significantly improves write performance without great impact to cache hits.
引用
收藏
页码:175 / 183
页数:9
相关论文
共 14 条
[1]  
[Anonymous], P 10 US C FIL STOR T
[2]  
[Anonymous], TECHNICAL REPORT
[3]  
[Anonymous], 2009, ACM INT C P SERIES
[4]  
[Anonymous], 2009, 7 USENIX C FIL STOR
[5]  
[Anonymous], P 5 ANN INT SYST STO
[6]  
[Anonymous], P 6 US C FIL STOR TE
[7]  
Bhagwat D, 2009, P 2009 IEEE INT S MO, P1
[8]   Reducing the Storage Burden via Data Deduplication [J].
Geer, David .
COMPUTER, 2008, 41 (12) :15-17
[9]   I/O Deduplication: Utilizing Content Similarity to Improve I/O Performance [J].
Koller, Ricardo ;
Rangaswami, Raju .
ACM TRANSACTIONS ON STORAGE, 2010, 6 (03)
[10]   A Study of Practical Deduplication [J].
Meyer, Dutch T. ;
Bolosky, William J. .
ACM TRANSACTIONS ON STORAGE, 2012, 7 (04)