The smallest grammar problem

被引:224
作者
Charikar, M [1 ]
Lehman, E
Liu, D
Panigrahy, R
Prabhakaran, M
Sahai, A
Shelat, A
机构
[1] Princeton Univ, Dept Comp Sci, Princeton, NJ 08540 USA
[2] MIT, CSAIL, Cambridge, MA 02139 USA
[3] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
[4] Univ Calif Los Angeles, Dept Comp Sci, Los Angeles, CA 90095 USA
关键词
approximation algorithm; data compression; hardness of approximation; LONGEST MATCH; LZ77; LZ78; multilevel pattern matching (MPM); RE-PAIR; SEQUITUR; smallest grammar problem;
D O I
10.1109/TIT.2005.850116
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper addresses the smallest grammar problem: What is the smallest context-free grammar that generates exactly one given string sigma? This is a natural question about a fundamental object connected to many fields such as data compression, Kolmogorov complexity, pattern identification, and addition chains. Due to the problem's inherent complexity, our objective is to find an approximation algorithm which finds a small grammar for. the input string. We focus attention on the approximation ratio of the algorithm (and implicitly, the worst case behavior) to establish provable performance guarantees. and to address shortcomings in the classical measure of redundancy in the literature. Our first results are concern the hardness of approximating the smallest grammar problem. Most notably, we show that every efficient algorithm for the smallest grammar problem has approximation ratio at least 8569/8568 unless P = NP. We then bound approximation ratios for several of the best known grammar-based compression algorithms, including LZ78, BISECTION, SEQUENTIAL, LONGEST MATCH, GREEDY, and RE-PAIR. Among these, the best upper bound we show is O(n(1/2)). We finish by presenting two novel algorithms with exponentially better ratios of O(log(3) n) and O(log(n/m*)), where m* is the size of the smallest grammar for that input. The latter algorithm highlights a connection between grammar-based compression and LZ77.
引用
收藏
页码:2554 / 2576
页数:23
相关论文
共 40 条
  • [1] [Anonymous], THESIS MIT
  • [2] [Anonymous], 1988, DATA COMPRESSION MET
  • [3] Apostolico A., 2000, Proceedings DCC 2000. Data Compression Conference, P143, DOI 10.1109/DCC.2000.838154
  • [4] Off-line compression by greedy textual substitution
    Apostolico, A
    Lonardi, S
    [J]. PROCEEDINGS OF THE IEEE, 2000, 88 (11) : 1733 - 1744
  • [5] Some theory and practice of greedy off-line textual substitution
    Apostolico, A
    Lonardi, S
    [J]. DCC '98 - DATA COMPRESSION CONFERENCE, 1998, : 119 - 128
  • [6] BERMAN P, 1998, TR98065 EL C COMP CO
  • [7] BLUM A, 1991, P 23 ANN ACM S THEOR, P328
  • [8] BRAYTON RK, 1982, P IEEE INT S CIRC SY, P49
  • [9] BRAYTON RK, 1987, P INT C COMP AID DES, P66
  • [10] BRYANT RE, 1986, IEEE T COMPUT, V35, P677, DOI 10.1109/TC.1986.1676819