Information Extraction from Spam Emails using Stylistic and Semantic Features to Identify Spammers

被引:0
|
作者
Halder, Soma [1 ]
Tiwari, Richa [1 ]
Sprague, Alan [1 ]
机构
[1] Univ Alabama Birmingham, Birmingham, AL 35229 USA
来源
2011 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI) | 2011年
关键词
Spam; semantics; stylistics; natural language processing; IP address;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Traditional anti spamming methods filter spam emails and prevent them from entering the inbox but take no measure to trace spammers and penalize them. We use natural language processing techniques to cluster spam emails from the same spammer based on the content and the style of the email. Spam emails from different sources are studied with features like stylistic, semantic and combination of both. Three sets of clustering are performed: clustering based on stylistic feature, clustering based on semantic feature and clustering based on combined feature. These clusters are then compared and evaluated. We notice that spam emails from the same sources have similarities and cluster together. These emails have URLs of the WebPages that the spammer is trying to promote. Clusters are mapped to the internet protocol (IP) of these URLs and the whois information of the IP addresses' help to get information about the source of spam.
引用
收藏
页码:104 / 107
页数:4
相关论文
共 38 条
  • [21] Extraction of Meaningful Information from Unstructured Clinical Notes Using Web Scraping
    Varshini, K. Sukanya
    Uthra, R. Annie
    JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2023, 32 (03)
  • [22] Information Extraction from Medical Texts with BERT Using Human-in-the-Loop Labeling
    Suvalov, Hendrik
    Laur, Sven
    Kolde, Raivo
    CARING IS SHARING-EXPLOITING THE VALUE IN DATA FOR HEALTH AND INNOVATION-PROCEEDINGS OF MIE 2023, 2023, 302 : 831 - 832
  • [23] Toward Complete Structured Information Extraction from Radiology Reports Using Machine Learning
    Jackson M. Steinkamp
    Charles Chambers
    Darco Lalevic
    Hanna M. Zafar
    Tessa S. Cook
    Journal of Digital Imaging, 2019, 32 : 554 - 564
  • [24] Extraction of protein interaction information from unstructured text using a link grammar parser
    Seoud, Rania A. Abul
    Youssef, Abou-Bakr M.
    Kadah, Yasser M.
    2007 INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING & SYSTEMS: ICCES '07, 2007, : 70 - +
  • [25] Toward Complete Structured Information Extraction from Radiology Reports Using Machine Learning
    Steinkamp, Jackson M.
    Chambers, Charles
    Lalevic, Darco
    Zafar, Hanna M.
    Cook, Tessa S.
    JOURNAL OF DIGITAL IMAGING, 2019, 32 (04) : 554 - 564
  • [26] Fiscal data in text: Information extraction from audit reports using Natural Language Processing
    Beltran, Alejandro
    DATA & POLICY, 2023, 5
  • [27] Information extraction for prognostic stage prediction from breast cancer medical records using NLP and ML
    Pratiksha R. Deshmukh
    Rashmi Phalnikar
    Medical & Biological Engineering & Computing, 2021, 59 : 1751 - 1772
  • [28] Information extraction for prognostic stage prediction from breast cancer medical records using NLP and ML
    Deshmukh, Pratiksha R.
    Phalnikar, Rashmi
    MEDICAL & BIOLOGICAL ENGINEERING & COMPUTING, 2021, 59 (09) : 1751 - 1772
  • [29] SAU-Net: A Novel Network for Building Extraction From High-Resolution Remote Sensing Images by Reconstructing Fine-Grained Semantic Features
    Chen, Meng
    Mao, Ting
    Wu, Jianjun
    Du, Ruohua
    Zhao, Bingyu
    Zhou, Litao
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 6747 - 6761
  • [30] Extraction of sleep information from clinical notes of Alzheimer's disease patients using natural language processing
    Sivarajkumar, Sonish
    Tam, Thomas Yu Chow
    Mohammad, Haneef Ahamed
    Viggiano, Samuel
    Oniani, David
    Visweswaran, Shyam
    Wang, Yanshan
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (10) : 2217 - 2227