Good to the Last Bit: Data-Driven Encoding with CodecDB

被引:23
|
作者
Jiang, Hao [1 ]
Liu, Chunwei [1 ]
Paparrizos, John [1 ]
Chien, Andrew A. [1 ]
Ma, Jihong [2 ]
Elmore, Aaron J. [1 ]
机构
[1] Univ Chicago, Chicago, IL 60637 USA
[2] Alibaba, Hangzhou, Peoples R China
基金
美国国家科学基金会;
关键词
COMPRESSION; ALGORITHM;
D O I
10.1145/3448016.3457283
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Columnar databases rely on specialized encoding schemes to reduce storage requirements. These encodings also enable efficient in-situ data processing. Nevertheless, many existing columnar databases are encoding-oblivious. When storing the data, these systems rely on a global understanding of the dataset or the data types to derive simple rules for encoding selection. Such rule-based selection leads to unsatisfactory performance. Specifically, when performing queries, the systems always decode data into memory, ignoring the possibility of optimizing access to encoded data. We develop CodecDB, an encoding-aware columnar database, to demonstrate the benefit of tightly-coupling the database design with the data encoding schemes. CodecDB chooses in a principled manner the most efficient encoding for a given data column and relies on encoding-aware query operators to optimize access to encoded data. Storage-wise, CodecDB achieves on average 90% accuracy for selecting the best encoding and improves the compression ratio by up to 40% compared to the state-of-the-art encoding selection solution. Query-wise, CodecDB is on average one order of magnitude faster than the latest open-source and commercial columnar databases on the TPC-H benchmark, and on average 3x faster than a recent research project on the Star-Schema Benchmark (SSB).
引用
收藏
页码:843 / 856
页数:14
相关论文
共 50 条
  • [1] Data-driven paradigm for encoding chemical intuition
    Pyzer-Knapp, Edward
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2015, 250
  • [2] Data-driven cold starting of good reservoirs
    Grigoryeva, Lyudmila
    Hamzi, Boumediene
    Kemeth, Felix P.
    Kevrekidis, Yannis
    Manjunath, G.
    Ortega, Juan-Pablo
    Steynberg, Matthys J.
    PHYSICA D-NONLINEAR PHENOMENA, 2024, 469
  • [3] PERFORMANCE EVALUATION OF 32 BIT DATA-DRIVEN PROCESSOR
    HATAKEYAMA, K
    YOSHIDA, S
    MIYATA, S
    SHARP TECHNICAL JOURNAL, 1992, (52): : 53 - 56
  • [4] Data-driven prediction of drilling strength ahead of the bit
    Mohagheghian, Erfan
    Hender, Donald G.
    Yousefzadeh, Reza
    Nikdelfaz, Fatemeh
    Said, Mohammed Mokhtar Ebeid
    Clarke, Alan
    Haynes, Ronald D.
    James, Lesley A.
    GEOENERGY SCIENCE AND ENGINEERING, 2024, 243
  • [5] Data-driven HRF estimation for encoding and decoding models
    Pedregosa, Fabian
    Eickenberg, Michael
    Ciuciu, Philippe
    Thirion, Bertrand
    Gramfort, Alexandre
    NEUROIMAGE, 2015, 104 : 209 - 220
  • [6] Data-driven encoding for quantitative genetic trait prediction
    Dan He
    Zhanyong Wang
    Laxmi Parida
    BMC Bioinformatics, 16
  • [7] Data-driven encoding for quantitative genetic trait prediction
    He, Dan
    Wang, Zhanyong
    Parida, Laxmi
    BMC BIOINFORMATICS, 2015, 16
  • [8] Data-driven optimization for last-mile delivery
    Chu, Hongrui
    Zhang, Wensi
    Bai, Pengfei
    Chen, Yahong
    COMPLEX & INTELLIGENT SYSTEMS, 2023, 9 (03) : 2271 - 2284
  • [9] Data-driven optimization for last-mile delivery
    Hongrui Chu
    Wensi Zhang
    Pengfei Bai
    Yahong Chen
    Complex & Intelligent Systems, 2023, 9 : 2271 - 2284
  • [10] In Good Company The Future of Women in Data-Driven Leadership
    Romney, Lynthia
    FORBES, 2015, 196 (09): : 45 - 48