Learning-based short text compression using BERT models

被引:0
作者
Ozturk, Emir [1 ]
Mesut, Altan [1 ]
机构
[1] Trakya Univ, Dept Comp Engn, Edirne, Turkiye
关键词
BERT; Fine tuning; Learning-based compression; Text compression;
D O I
10.7717/peerj-cs.2423
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning-based data compression methods have gained significant attention in recent years. Although these methods achieve higher compression ratios compared to traditional techniques, their slow processing times make them less suitable for compressing large datasets, and they are generally more effective for short texts rather than longer ones. In this study, MLMCompress, a word-based text compression method that can utilize any BERT masked language model is introduced. The performance of MLMCompress is evaluated using four BERT models: two large models and two smaller models referred to as "tiny". The large models are used without training, while the smaller models are fine-tuned. The results indicate that MLMCompress, when using the best-performing model, achieved 3838% higher compression ratios for English text and 42% higher compression ratios for multilingual text compared to NNCP, another learning-based method. Although the method does not yield better results than GPTZip, which has been developed in recent years, it achieves comparable outcomes while being up to 35 times faster in the worst-case scenario. Additionally, it demonstrated a 20% improvement in compression speed and a 180% improvement in decompression speed in the best case. Furthermore, MLMCompress outperforms traditional compression methods like Gzip and specialized short text compression methods such as Smaz and Shoco, particularly in compressing short texts, even when using smaller models.
引用
收藏
页数:23
相关论文
共 28 条
[1]  
[Anonymous], 2013, International Journal of Communications, Network and System Sciences, DOI DOI 10.4236/IJCNS.2013.612053
[2]   A New Method for Short Text Compression [J].
Aslanyurek, Murat ;
Mesut, Altan .
IEEE ACCESS, 2023, 11 :141022-141035
[3]  
Bellard F., 2021, Technical report
[4]  
Bellard Fabrice, 2019, Lossless data compression with neural networks
[5]   New adaptive compressors for natural language text [J].
Brisaboa, N. R. ;
Farina, A. ;
Navarro, G. ;
Parama, J. R. .
SOFTWARE-PRACTICE & EXPERIENCE, 2008, 38 (13) :1429-1450
[6]   Dynamic Lightweight Text Compression [J].
Brisaboa, Nieves ;
Farina, Antonio ;
Navarro, Gonzalo ;
Parama, Jose .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2010, 28 (03)
[7]   Lightweight natural language text compression [J].
Brisaboa, Nieves R. ;
Farina, Antonio ;
Navarro, Gonzalo ;
Parama, Jose R. .
INFORMATION RETRIEVAL, 2007, 10 (01) :1-33
[8]   A new word-based compression model allowing compressed pattern matching [J].
Bulus, Halil Nusret ;
Carus, Aydin ;
Mesut, Altan .
TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2017, 25 (05) :3607-3622
[9]   DATA-COMPRESSION USING ADAPTIVE CODING AND PARTIAL STRING MATCHING [J].
CLEARY, JG ;
WITTEN, IH .
IEEE TRANSACTIONS ON COMMUNICATIONS, 1984, 32 (04) :396-402
[10]  
Clinchant S, 2019, arXiv