Spam filtering using statistical data compression models

被引:0
|
作者
Department of Intelligent Systems, Jožef Stefan Institute, Jamova 39, Ljubljana, SI-1000, Slovenia [1 ]
不详 [2 ]
不详 [3 ]
机构
来源
J. Mach. Learn. Res. | 2006年 / 2673-2698期
关键词
Adaptive filtering - Classification (of information) - Data compression - Electronic mail - Learning algorithms - Markov processes - Text processing;
D O I
暂无
中图分类号
学科分类号
摘要
Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. In this paper, we investigate a novel approach to spam filtering based on adaptive statistical data compression models. The nature of these models allows them to be employed as probabilistic text classifiers based on character-level or binary sequences. By modeling messages as sequences, tokenization and other error-prone preprocessing steps are omitted altogether, resulting in a method that is very robust. The models are also fast to construct and incrementally updateable. We evaluate the filtering performance of two different compression algorithms; dynamic Markov compression and prediction by partial matching. The results of our empirical evaluation indicate that compression models outperform currently established spam filters, as well as a number of methods proposed in previous studies.
引用
收藏
相关论文
共 50 条
  • [21] Using Probabilistic Models for Data Compression
    Iatan, Iuliana
    Dragan, Mihaita
    Dedu, Silvia
    Preda, Vasile
    MATHEMATICS, 2022, 10 (20)
  • [22] Spam filtering using Kolmogorov complexity analysis
    Richard, G.
    Doncescu, A.
    INTERNATIONAL JOURNAL OF WEB AND GRID SERVICES, 2008, 4 (01) : 136 - 148
  • [23] On the Study of Anomaly-based Spam Filtering Using Spam as Representation of Normality
    Laorden, Carlos
    Ugarte-Pedrero, Xabier
    Santos, Igor
    Sanz, Borja
    Nieves, Javier
    Bringas, Pablo G.
    2012 IEEE CONSUMER COMMUNICATIONS AND NETWORKING CONFERENCE (CCNC), 2012, : 693 - 695
  • [24] Lossy compression of statistical data using quantum annealer
    Boram Yoon
    Nga T. T. Nguyen
    Chia Cheng Chang
    Ermal Rrapaj
    Scientific Reports, 12
  • [25] Lossy compression of statistical data using quantum annealer
    Yoon, Boram
    Nguyen, Nga T. T.
    Chang, Chia Cheng
    Rrapaj, Ermal
    SCIENTIFIC REPORTS, 2022, 12 (01)
  • [26] Adaptive filtering of SPAM
    Pelletier, L
    Almhana, J
    Choulakian, V
    SECOND ANNUAL CONFERENCE ON COMMUNICATION NETWORKS AND SERVICES RESEARCH, PROCEEDINGS, 2004, : 218 - 224
  • [27] Spam filtering scheme
    Wang, Jing (wngjing@hotmail.com), 1600, Northeast University (35):
  • [28] Short Messages Spam Filtering Using Sentiment Analysis
    Ezpeleta, Enaitz
    Zurutuza, Urko
    Gomez Hidalgo, Jose Maria
    TEXT, SPEECH, AND DIALOGUE, 2016, 9924 : 142 - 153
  • [29] Adaptive spam mail filtering using genetic algorithm
    Sanpakdee, U
    Walairacht, A
    Walairacht, S
    8th International Conference on Advanced Communication Technology, Vols 1-3: TOWARD THE ERA OF UBIQUITOUS NETWORKS AND SOCIETIES, 2006, : U441 - U445
  • [30] Email Spam Filtering
    Puertas Sanz, Enrique
    Gomez Hidalgo, Jose Maria
    Cortizo Perez, Jose Carlos
    ADVANCES IN COMPUTERS, VOL 74: SOFTWARE DEVELOPMENT, 2008, 74 : 45 - 114