Unsupervised tweets categorization using semantic and statistical features

被引:0
|
作者
Maibam Debina Devi
Navanath Saharia
机构
[1] IIIT Senapati,Data Engineering Lab
来源
关键词
Unsupervised learning; Social blogging; Semantic similarity; tf-idf; DBSCAN;
D O I
暂无
中图分类号
学科分类号
摘要
Clustering is one of the widely used techniques in information retrieval. This experiment intends to categorize Tweets (based on their content) as representative of social media/user-generated content by exploiting statistical and semantic features. tf-idf, being widespread, is employed in combination with a synonym-based weighting scheme. The output of tf-idf in the form of the weight vector is transferred to the next phase as input, where based on the word synonyms, the system generate another weighted vector. Both vectors are used as a feature for clustering. The synonym-based feature technique adds semantic importance to the formation of the clusters. Using a density-based categorical clustering algorithm (with 8 as minpoints and 1.5 as epsilon), we categorized the Tweets into clusters. Six clusters are formed from 1K Tweets, which are evaluated manually and found cohesive. The Silhouette coefficient score (0.47) is used to validate the clusters.
引用
收藏
页码:9047 / 9064
页数:17
相关论文
共 50 条
  • [1] Unsupervised tweets categorization using semantic and statistical features
    Devi, Maibam Debina
    Saharia, Navanath
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (06) : 9047 - 9064
  • [2] Texture Categorization Using Statistical and Spectral Features
    Arivazhagan, S.
    Nidhyanandhan, S. Selva
    Shebiah, R. Newlin
    ICCN: 2008 INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND NETWORKING, 2008, : 366 - 373
  • [3] On the use of supervised features for unsupervised image categorization: An evaluation
    Ciocca, Gianluigi
    Cusano, Claudio
    Santini, Simone
    Schettini, Raimondo
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2014, 122 : 155 - 171
  • [4] Paraphrase identification and semantic text similarity analysis in Arabic news tweets using lexical, syntactic, and semantic features
    Al-Smadi, Mohammad
    Jaradat, Zain
    Al-Ayyoub, Mahmoud
    Jararweh, Yaser
    INFORMATION PROCESSING & MANAGEMENT, 2017, 53 (03) : 640 - 652
  • [5] Unsupervised Software Categorization using Bytecode
    Escobar-Avila, Javier
    Linares-Vasquez, Mario
    Haiduc, Sonia
    2015 IEEE 23RD INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION ICPC 2015, 2015, : 229 - 239
  • [6] Disaster damage assessment from the tweets using the combination of statistical features and informative words
    Sreenivasulu Madichetty
    M. Sridevi
    Social Network Analysis and Mining, 2019, 9
  • [7] Disaster damage assessment from the tweets using the combination of statistical features and informative words
    Madichetty, Sreenivasulu
    Sridevi, M.
    SOCIAL NETWORK ANALYSIS AND MINING, 2019, 9 (01)
  • [8] Leveraging Sublanguage Features for the Semantic Categorization of Clinical Terms
    Gron, Leonie
    Bertels, Ann
    Heylen, Kris
    SIGBIOMED WORKSHOP ON BIOMEDICAL NATURAL LANGUAGE PROCESSING (BIONLP 2019), 2019, : 211 - 216
  • [9] Tweets Clustering using Latent Semantic Analysis
    Rasidi, Norsuhaili Mahamed
    Abu Bakar, Sakhinah
    Razak, Fatimah Abdul
    4TH INTERNATIONAL CONFERENCE ON MATHEMATICAL SCIENCES (ICMS4): MATHEMATICAL SCIENCES: CHAMPIONING THE WAY IN A PROBLEM BASED AND DATA DRIVEN SOCIETY, 2017, 1830
  • [10] Sentiment Analysis of Tweets Using Semantic Analysis
    Kale, Snehal
    Padmadas, Vijaya
    2017 INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION, CONTROL AND AUTOMATION (ICCUBEA), 2017,