A New Method for Short Text Compression

被引:2
|
作者
Aslanyurek, Murat [1 ]
Mesut, Altan [2 ]
机构
[1] Kirklareli Univ, Pinarhisar Vocat Sch, Comp Programming Program, TR-39300 Kirklareli, Turkiye
[2] Trakya Univ, Comp Engn Dept, TR-22100 Edirne, Turkiye
关键词
Machine learning; Text categorization; text compression; k-means; clustering; LANGUAGE IDENTIFICATION;
D O I
10.1109/ACCESS.2023.3340436
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Short texts cannot be compressed effectively with general-purpose compression methods. Methods developed to compress short texts often use static dictionaries. In order to achieve high compression ratios, using a static dictionary suitable for the text to be compressed is an important problem that needs to be solved. In this study, a method called WSDC (Word-based Static Dictionary Compression), which can compress short texts at a high ratio, and a model that uses iterative clustering to create static dictionaries used in this method are proposed. The number of static dictionaries to be created can vary by running the k-Means clustering algorithm iteratively according to some rules. A method called DSWF (Dictionary Selection by Word Frequency) is also presented to determine which of the created dictionaries can compress the source text at the best ratio. Wikipedia article abstracts consisting of 6 different languages were used as the dataset in the experiments. The developed WSDC method is compared with both general-purpose compression methods (Gzip, Bzip2, PPMd, Brotli and Zstd) and special methods used for compression of short texts (shoco, b64pack and smaz). According to the test results, although WSDC is slower than some other methods, it achieves the best compression ratios for short texts smaller than 200 bytes and better than other methods except Zstd for short texts smaller than 1000 bytes.
引用
收藏
页码:141022 / 141035
页数:14
相关论文
共 50 条
  • [1] Short Text Compression for Smart Devices
    Islam, Md. Rafiqul
    Rajon, S. A. Ahsan
    Podder, Anonda
    2008 11TH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY: ICCIT 2008, VOLS 1 AND 2, 2008, : 126 - +
  • [2] Compression of Short Text on Embedded Systems
    Rein, Stephan
    Guhmann, Clemens
    Fitzek, Frank
    JOURNAL OF COMPUTERS, 2006, 1 (06) : 1 - 10
  • [3] Rapid lossless compression of short text messages
    Kalajdzic, Kenan
    Ali, Samaher Hussein
    Patel, Ahmed
    COMPUTER STANDARDS & INTERFACES, 2015, 37 : 53 - 59
  • [4] A New Feature Selection Method for Sentiment Analysis in Short Text
    Kumar, H. M. Keerthi
    Harish, B. S.
    JOURNAL OF INTELLIGENT SYSTEMS, 2020, 29 (01) : 1122 - 1134
  • [5] A New Classificaiton Method for Short Text Based on SLAS and CART
    Yin, Chunyong
    Xiang, Jun
    Zhang, Hui
    Wang, Jin
    2015 FIRST INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE THEORY, SYSTEMS AND APPLICATIONS (CCITSA 2015), 2015, : 133 - 135
  • [6] A synergistic text compression method - STCM
    Blandon, J
    Adjouadi, M
    Emami, S
    2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 2773 - 2776
  • [7] BINARY-CODED TEXT - A TEXT-COMPRESSION METHOD
    TROPPER, R
    BYTE, 1982, 7 (04): : 398 - &
  • [8] An Enhanced Short Text Compression Scheme for Smart Devices
    Islam, Md. Rafiqul
    Rajon, S. A. Ahsan
    JOURNAL OF COMPUTERS, 2010, 5 (01) : 49 - 58
  • [9] A new method of chinese short text classification based on the domain ontology
    Yang, Fengqin
    Zhou, Xu
    Wu, Di
    Yang, Xiquan
    Sun, Tieli
    Sun, T. (suntl@nenu.edu.cn), 1600, ICIC Express Letters Office, Tokai University, Kumamoto Campus, 9-1-1, Toroku, Kumamoto, 862-8652, Japan (06): : 1399 - 1404
  • [10] An effective short text conceptualization based on new short text similarity
    Bekkali, Mohammed
    Lachkar, Abdelmonaime
    SOCIAL NETWORK ANALYSIS AND MINING, 2018, 9 (01)