Good to the Last Bit: Data-Driven Encoding with CodecDB

被引:23
|
作者
Jiang, Hao [1 ]
Liu, Chunwei [1 ]
Paparrizos, John [1 ]
Chien, Andrew A. [1 ]
Ma, Jihong [2 ]
Elmore, Aaron J. [1 ]
机构
[1] Univ Chicago, Chicago, IL 60637 USA
[2] Alibaba, Hangzhou, Peoples R China
基金
美国国家科学基金会;
关键词
COMPRESSION; ALGORITHM;
D O I
10.1145/3448016.3457283
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Columnar databases rely on specialized encoding schemes to reduce storage requirements. These encodings also enable efficient in-situ data processing. Nevertheless, many existing columnar databases are encoding-oblivious. When storing the data, these systems rely on a global understanding of the dataset or the data types to derive simple rules for encoding selection. Such rule-based selection leads to unsatisfactory performance. Specifically, when performing queries, the systems always decode data into memory, ignoring the possibility of optimizing access to encoded data. We develop CodecDB, an encoding-aware columnar database, to demonstrate the benefit of tightly-coupling the database design with the data encoding schemes. CodecDB chooses in a principled manner the most efficient encoding for a given data column and relies on encoding-aware query operators to optimize access to encoded data. Storage-wise, CodecDB achieves on average 90% accuracy for selecting the best encoding and improves the compression ratio by up to 40% compared to the state-of-the-art encoding selection solution. Query-wise, CodecDB is on average one order of magnitude faster than the latest open-source and commercial columnar databases on the TPC-H benchmark, and on average 3x faster than a recent research project on the Star-Schema Benchmark (SSB).
引用
收藏
页码:843 / 856
页数:14
相关论文
共 50 条
  • [31] DATA-DRIVEN ORIGINALISM
    Lee, Thomas R.
    Phillips, James C.
    UNIVERSITY OF PENNSYLVANIA LAW REVIEW, 2019, 167 (02) : 261 - 335
  • [32] Data-Driven Hiring
    Belfort, Georges
    SCIENTIST, 2021, 35 (06): : 16 - 17
  • [33] It pays to be data-driven
    Indium Corp.
    不详
    不详
    SMT Surface Mount Technology Magazine, 2006, 20 (12):
  • [34] Data-driven deconvolution
    Hesse, CH
    JOURNAL OF NONPARAMETRIC STATISTICS, 1999, 10 (04) : 343 - 373
  • [35] Data-Driven Phenotyping
    Nemati, Shamim
    Orr, Jeremy
    Malhotra, Atul
    IEEE PULSE, 2014, 5 (05) : 45 - 48
  • [36] The data-driven classroom
    Bondeson, SR
    Brummer, JG
    Wright, SM
    JOURNAL OF CHEMICAL EDUCATION, 2001, 78 (01) : 56 - 57
  • [37] The Data-driven Industry
    Kwortnik, Robert
    CORNELL HOSPITALITY QUARTERLY, 2013, 54 (01) : 4 - 4
  • [38] Data-driven hypotheses
    van Helden, Paul
    EMBO REPORTS, 2013, 14 (02) : 104 - 104
  • [39] Data-driven Geodynamics
    Ismail-Zadeh, Alik
    JOURNAL OF THE GEOLOGICAL SOCIETY OF INDIA, 2021, 97 (03) : 223 - 226
  • [40] DATA-DRIVEN ONTOLOGIES
    Costello, James C.
    Schrider, Dan
    Gehlhausen, Jeff
    Dalkilic, Mehmet
    PACIFIC SYMPOSIUM ON BIOCOMPUTING 2009, 2009, : 15 - 26