Practical String Dictionary Compression Using String Dictionary Encoding

被引:23
作者
Kanda, Shunsuke [2 ]
Morita, Kazuhiro [1 ]
Fuketa, Masao [1 ]
机构
[1] Tokushima Univ, Grad Sch Adv Technol & Sci, Tokushima, Japan
[2] Japan Soc Promot Sci, Tokyo, Japan
来源
2017 3RD INTERNATIONAL CONFERENCE ON BIG DATA INNOVATIONS AND APPLICATIONS (INNOVATE-DATA) | 2017年
关键词
ALGORITHM;
D O I
10.1109/Innovate-Data.2017.9
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A string dictionary is a data structure for storing a set of strings that maps them to unique IDs. It can manage string data in compact space by encoding them into integers. However, instances have recently emerged in practice where the size of string dictionaries has become a critical problem for very large datasets in many applications. A number of compressed string dictionaries have been proposed as a solution. In particular, the application of Re-Pair, a powerful text compression technique, to tries and front coding can help to obtain compact string dictionaries that support fast dictionary operations. However, the cost of constructing such dictionaries using Re-Pair is impractical for large datasets. In this paper, we propose an alternative compression strategy using string dictionary encoding and develop several dictionary structures for it. We show that our string dictionaries can be constructed up to 422.5x faster than the Re-Pair versions with competitive space and operation speed, through experiments on real-world datasets.
引用
收藏
页码:1 / 8
页数:8
相关论文
共 31 条
[1]  
[Anonymous], 1998, SORTING SEARCHING
[2]   A trie compaction algorithm for a large set of keys [J].
Aoe, J ;
Morimoto, K ;
Shishibori, M ;
Park, KH .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 1996, 8 (03) :476-491
[3]   AN EFFICIENT DIGITAL SEARCH ALGORITHM BY USING A DOUBLE-ARRAY STRUCTURE [J].
AOE, JI .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 1989, 15 (09) :1066-1077
[4]  
Arroyuelo D., 2010, P WORKSH ALG ENG EXP, P84
[5]   LZ-Compressed String Dictionaries [J].
Arz, Julian ;
Fiseher, Johannes .
2014 DATA COMPRESSION CONFERENCE (DCC 2014), 2014, :322-331
[6]   Representing trees of higher degree [J].
Benoit, D ;
Demaine, ED ;
Munro, JI ;
Raman, R ;
Raman, V ;
Rao, SS .
ALGORITHMICA, 2005, 43 (04) :275-292
[7]   UbiCrawler: a scalable fully distributed Web crawler [J].
Boldi, P ;
Codenotti, B ;
Santini, M ;
Vigna, S .
SOFTWARE-PRACTICE & EXPERIENCE, 2004, 34 (08) :711-726
[8]   The smallest grammar problem [J].
Charikar, M ;
Lehman, E ;
Liu, D ;
Panigrahy, R ;
Prabhakaran, M ;
Sahai, A ;
Shelat, A .
IEEE TRANSACTIONS ON INFORMATION THEORY, 2005, 51 (07) :2554-2576
[9]   Fast and Compact Web Graph Representations [J].
Claude, Francisco ;
Navarro, Gonzalo .
ACM TRANSACTIONS ON THE WEB, 2010, 4 (04)
[10]  
Delpratt O, 2006, LECT NOTES COMPUT SC, V4007, P134