A general-purpose compression scheme for large collections

被引:10
|
作者
Cannane, A [1 ]
Williams, HE [1 ]
机构
[1] RMIT Univ, Sch Comp Sci & Informat Technol, Melbourne, Vic 3001, Australia
关键词
algorithms; performance; phrase-based compression; random access; sampling;
D O I
10.1145/568727.568730
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Compression of large collections can lead to improvements in retrieval times by offsetting the CPU decompression costs with the cost of seeking and retrieving data from disk. We propose a semistatic phrase-based approach called XRAY that builds a model offline using sample training data extracted from a collection, and then compresses the entire collection online in a single pass. The particular benefits of XRAY are that it can be used in applications where individual records or documents must be decompressed, and that decompression is fast. The XRAY scheme also allows new data to be added to a collection without modifying the semistatic model. Moreover, XRAY can be used to compress general-purpose data such as genomic, scientific, image, and geographic collections without prior knowledge of the structure of the data. We show that XRAY is effective on both text and general-purpose collections. In general, XRAY is more effective than the popular GZIP and COMPRESS schemes, while being marginally less effective than BZIP2. We also show that XRAY is efficient: of the popular schemes we tested, it is typically only slower than GZIP in decompression. Moreover, the query evaluation costs of retrieval of documents from a large collection with our search engine is improved by more than 30% when XRAY is incorporated compared to an uncompressed approach. We use simple techniques for obtaining the training data from the collection to be compressed and show that with just over 4% of data the entire collection can be effectively compressed. We also propose four schemes for phrase-match selection during the single pass compression of the collection. We conclude that with these novel approaches XRAY is a fast and effective scheme for compression and decompression of large general-purpose collections.
引用
收藏
页码:329 / 355
页数:27
相关论文
共 50 条
  • [1] A general-purpose compression scheme for databases
    Cannane, A
    Williams, HE
    Zobel, J
    DCC '99 - DATA COMPRESSION CONFERENCE, PROCEEDINGS, 1999, : 519 - 519
  • [2] General-purpose compression for efficient retrieval
    Cannane, A
    Williams, HE
    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2001, 52 (05): : 430 - 437
  • [3] AN ARRAY PROCESSOR FOR GENERAL-PURPOSE DIGITAL IMAGE COMPRESSION
    YATES, RB
    THACKER, NA
    EVANS, SJ
    WALKER, SN
    IVEY, PA
    IEEE JOURNAL OF SOLID-STATE CIRCUITS, 1995, 30 (03) : 244 - 250
  • [4] Using general-purpose compression algorithms for music analysis
    Louboutin, Corentin
    Meredith, David
    JOURNAL OF NEW MUSIC RESEARCH, 2016, 45 (01) : 1 - 16
  • [5] General-purpose detectors for the large hadron collider
    Froidevaux, Daniel
    Sphicas, Paris
    ANNUAL REVIEW OF NUCLEAR AND PARTICLE SCIENCE, 2006, 56 : 375 - 440
  • [6] FDup: a framework for general-purpose and efficient entity deduplication of record collections
    De Bonis, Michele
    Manghi, Paolo
    Atzori, Claudio
    PEERJ COMPUTER SCIENCE, 2022, 8
  • [7] FDup: a framework for general-purpose and efficient entity deduplication of record collections
    De Bonis M.
    Manghi P.
    Atzori C.
    PeerJ Computer Science, 2022, 8
  • [8] General-purpose definition
    Emerson, DM
    DATAMATION, 1995, 41 (23): : 14 - 14
  • [9] General-purpose cells?
    Solter, D
    Gearhart, J
    RECHERCHE, 1999, (320): : 32 - 34
  • [10] A GENERAL-PURPOSE MACROGENERATOR
    STRACHEY, C
    COMPUTER JOURNAL, 1965, 8 (03): : 225 - 241