Generalized Deduplication: Lossless Compression by Clustering Similar Data

被引:7
作者
Talasila, Prasad [1 ]
Lucani, Daniel E. [1 ]
机构
[1] Aarhus Univ, Dept Engn, DIGIT, Aarhus, Denmark
来源
PROCEEDING OF THE 2019 IEEE 8TH INTERNATIONAL CONFERENCE ON CLOUD NETWORKING (CLOUDNET) | 2019年
关键词
generalized deduplication; golomb rice; geometric distribution; data deduplication;
D O I
10.1109/cloudnet47604.2019.9064140
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This paper proposes generalized deduplication, a concept where similar data is systematically deduplicated by first transforming chunks of each file into two parts: a basis and a deviation. This increases the potential for compression as more chunks can have a common basis that can be deduplicated by the system. The deviation is kept small and stored together with an identifier to its chunk, e.g., hash of a chunk, in order to recover the original data without errors or distortions. This paper characterizes the performance of generalized deduplication using Golomb-Rice codes as a suitable data transform function to discover similarities across all files stored in the system. Considering different synthetic data distributions, we show in theory and simulations that generalized deduplication can result in compression factors of 300 (high compression), i.e., 300 times less storage space, and that this compression is achieved with 60, 000 times fewer data chunks inserted into the system compared to classic deduplication (compression gains start earlier). Finally, we show that the table/registry to recognize similar chunks is 10, 000 times smaller for generalized deduplication compared to the table in classic deduplication techniques, which will result in less RAM usage in the storage system.
引用
收藏
页数:4
相关论文
共 15 条
[1]  
Douglis F, 2003, USENIX ASSOCIATION PROCEEDINGS OF THE GENERAL TRACK, P113
[2]   OPTIMAL SOURCE CODES FOR GEOMETRICALLY DISTRIBUTED INTEGER ALPHABETS [J].
GALLAGER, RG ;
VANVOORHIS, DC .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1975, 21 (02) :228-230
[3]   RUN-LENGTH ENCODINGS [J].
GOLOMB, SW .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1966, 12 (03) :399-+
[4]  
Gonzalez C., 2013, ZFS DEDUPE NOT DEDUP
[5]   A METHOD FOR THE CONSTRUCTION OF MINIMUM-REDUNDANCY CODES [J].
HUFFMAN, DA .
PROCEEDINGS OF THE INSTITUTE OF RADIO ENGINEERS, 1952, 40 (09) :1098-1101
[6]  
Kulkarni P., USENIX A TECH C
[7]  
Liebchen T, 2004, IEEE DATA COMPR CONF, P439
[8]  
Quinlan S, 2002, USENIX ASSOCIATION PROCEEDINGS OF THE FAST'02 CONFERENCE ON FILE AND STORAGE TECHNOLOGIES, P89
[9]   ADAPTIVE VARIABLE-LENGTH CODING FOR EFFICIENT COMPRESSION OF SPACECRAFT TELEVISION DATA [J].
RICE, RF ;
PLAUNT, JR .
IEEE TRANSACTIONS ON COMMUNICATION TECHNOLOGY, 1971, CO19 (06) :889-&
[10]  
Sarama J, 2009, STUD MATH THINK LEAR, P101