Learning-based short text compression using BERT models

被引：0

作者：

Ozturk, Emir ^{[1
]}

Mesut, Altan ^{[1
]}

机构：

[1] Trakya Univ, Dept Comp Engn, Edirne, Turkiye

来源：

PEERJ COMPUTER SCIENCE | 2024年 / 10卷

关键词：

BERT; Fine tuning; Learning-based compression; Text compression;

D O I：

10.7717/peerj-cs.2423

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Learning-based data compression methods have gained significant attention in recent years. Although these methods achieve higher compression ratios compared to traditional techniques, their slow processing times make them less suitable for compressing large datasets, and they are generally more effective for short texts rather than longer ones. In this study, MLMCompress, a word-based text compression method that can utilize any BERT masked language model is introduced. The performance of MLMCompress is evaluated using four BERT models: two large models and two smaller models referred to as "tiny". The large models are used without training, while the smaller models are fine-tuned. The results indicate that MLMCompress, when using the best-performing model, achieved 3838% higher compression ratios for English text and 42% higher compression ratios for multilingual text compared to NNCP, another learning-based method. Although the method does not yield better results than GPTZip, which has been developed in recent years, it achieves comparable outcomes while being up to 35 times faster in the worst-case scenario. Additionally, it demonstrated a 20% improvement in compression speed and a 180% improvement in decompression speed in the best case. Furthermore, MLMCompress outperforms traditional compression methods like Gzip and specialized short text compression methods such as Smaz and Shoco, particularly in compressing short texts, even when using smaller models.

引用

页数：23

共 28 条

[1]

[Anonymous], 2013, International Journal of Communications, Network and System Sciences, DOI DOI 10.4236/IJCNS.2013.612053

[2] A New Method for Short Text Compression [J].