Improving semistatic compression via phrase-based modeling

被引:2
|
作者
Brisaboa, Nieves R. [1 ]
Farina, Antonio [1 ]
Navarro, Gonzalo [2 ]
Parama, Jose R. [1 ]
机构
[1] Univ A Coruna, Database Lab, Fac Informat, La Coruna 15071, Spain
[2] Univ Chile, Dept Comp Sci, Santiago, Chile
关键词
Text compression; Direct search; ALGORITHM;
D O I
10.1016/j.ipm.2011.01.006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, new semistatic word-based byte-oriented text compressors, such as Tagged Huffman and those based on Dense Codes, have shown that it is possible to perform fast direct search over compressed text and decompression of arbitrary text passages over collections reduced to around 30-35% of their original size. Much of their success is due to the use of words as source symbols and a byte-oriented target alphabet. This approach broke with traditional statistical compressors, which use characters as source symbols and a bit-oriented target alphabet. In this work we go one step beyond by using phrases as source symbols. We present two new semistatic modelers that we combined with a dense coding scheme to obtain two new compressors: Pair-Based End-Tagged Dense Code (PETDC), where source symbols can be either words or pairs of words, and Phrase-Based End-Tagged Dense Code (PhETDC), which considers words and sequences of words (phrases). PETDC compresses English texts to 28-29% and PhETDC to around 23%, outperforming the optimal byte-oriented zero-order prefix-free word-based semistatic compressor by up to 8 percentage points. Moreover, PETDC and PhETDC still permit random access and efficient direct searches using fast Boyer-Moore algorithms. (C) 2011 Elsevier Ltd. All rights reserved.
引用
收藏
页码:545 / 559
页数:15
相关论文
共 50 条
  • [41] Syntactic phrase-based statistical machine translation
    Hassan, Hany
    Heame, Mary
    Way, Andy
    Sima'an, Khalil
    2006 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, 2006, : 238 - +
  • [42] Phrase-Based Machine Translation based on Simulated Annealing
    Lavecchia, Caroline
    Langlois, David
    Smaili, Kamel
    SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 3123 - 3129
  • [43] Phrase table filtration based on virtual context in phrase-based statistical machine translation
    Yin, Yue
    Zhang, Yu Jie
    Xu, Jin An
    INFORMATION TECHNOLOGY AND COMPUTER APPLICATION ENGINEERING, 2014, : 327 - 330
  • [44] Improving Phrase-Based Statistical Machine Translation Models by Incorporating Syntax-Based Language Models
    陈毅东
    史晓东
    Journal of Donghua University(English Edition), 2010, 27 (02) : 185 - 188
  • [45] A methodology for noun phrase-based automatic indexing
    Souza, Renato Rocha
    Raghavan, K. S.
    KNOWLEDGE ORGANIZATION, 2006, 33 (01): : 45 - 56
  • [46] Translation paraphrases in phrase-based machine translation
    Guzman, Francisco
    Garrido, Leonardo
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2008, 4919 : 388 - 398
  • [47] Phrase-based hashtag recommendation for microblog posts
    Gong, Yeyun
    Zhang, Qi
    Han, Xiaoying
    Huang, Xuanjing
    SCIENCE CHINA-INFORMATION SCIENCES, 2017, 60 (01)
  • [48] Phrase-based part-of-speech tagging
    Finch, Andrew
    Sumita, Eiichiro
    PROCEEDINGS OF THE 2007 IEEE INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (NLP-KE'07), 2007, : 215 - +
  • [49] Monte Carlo techniques for phrase-based translation
    Arun, Ahhishek
    Haddow, Barry
    Koehn, Philipp
    Lopez, Adam
    Dyer, Chris
    Blunsom, Phil
    MACHINE TRANSLATION, 2010, 24 (02) : 103 - 121
  • [50] The CASIA phrase-based machine translation system
    Yang, ZD
    Chen, ZB
    Pang, W
    Wei, W
    Xu, B
    PROCEEDINGS OF THE 2005 IEEE INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (IEEE NLP-KE'05), 2005, : 416 - 419