HashCat: A Novel Approach for the Topic Classification of Multilingual Twitter Trends

被引:2
作者
Kausar, Soufia [1 ]
Tahir, Bilal [1 ]
Mehmood, Muhammad Amir [1 ]
机构
[1] Univ Engn & Technol, Al Khawarizmi Inst Comp Sci, Lahore, Pakistan
来源
2021 INTERNATIONAL CONFERENCE ON FRONTIERS OF INFORMATION TECHNOLOGY (FIT 2021) | 2021年
关键词
Twitter hashtags; topic classification; hashtag segmentation;
D O I
10.1109/FIT53504.2021.00047
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
With the growing usage of online social networks, an enormous amount of data is generated by users daily. Twitter microblog groups tweets of the same hashtag which is beneficial for the users to extract the required information for the target hashtag effortlessly. However, understanding these hashtags is a challenging task as tweets contain short, multi-lingual content and non-standard vocabulary. In this article, we propose HashCat - a novel approach for the topic classification of multilingual Twitter trends. In addition, we present a technique for the segmentation of English and Urdu language hashtags. First, we develop a labelled dataset of HT-Dat containing 1,882 hashtags of Urdu and English languages by manually labelling them into six wide range categories. Next, we utilize the features of i) tweet text, ii) co-occurrence and iii) segment similarity for the classification of hashtags. The HashCat achieves an overall accuracy of 0.93 on the HT-Dat dataset. The classification results and Type-Token Ratio analysis for various hashtag categories reveal that the categories with low lexical diversity are classified with higher accuracy by the HashCat classifier. We believe that our methodology can be helpful for social media analysts to conduct research on specific domain hashtags.
引用
收藏
页码:212 / 217
页数:6
相关论文
empty
未找到相关数据