Comparative analysis of TF-IDF and loglikelihood method for keywords extraction of twitter data

被引：5

作者：

Abid, Muhammad Adeel ^{[1
]}

Mushtaq, Muhammad Faheem ^{[2
]}

Akram, Urooj ^{[2
]}

Abbasi, Mateen Ahmed ^{[1
]}

Rustam, Furqan ^{[1
]}

机构：

[1] Khwaja Fareed Univ Engn & Informat Technol, Fac Informat Technol, Rahim Yar Khan 64200, Pakistan

[2] Islamia Univ Bahawalpur, Dept Artificial Intelligence, Bahawalpur 63100, Pakistan

来源：

MEHRAN UNIVERSITY RESEARCH JOURNAL OF ENGINEERING AND TECHNOLOGY | 2023年 / 42卷 / 01期

关键词：

Twitter; Social Media; Classification; Loglikelihood Methods; Term Frequency-Inverse; Document Frequency;

D O I：

10.22581/muet1982.2301.09

中图分类号：

T [工业技术];

学科分类号：

08 ;

摘要：

Twitter has become the foremost standard of social media in today's world. Over 335 million users are online monthly, and near about 80% are accessing it through their mobiles. Further, Twitter is now supporting 35+ which enhance its usage too much. It facilitates people having different languages. Near about 21% of the total users are from US and 79% of total users are outside of US. A tweet is restricted to a hundred and forty characters; hence it contains such information which is more concise and much valuable. Due to its usage, it is estimated that five hundred million tweets are sent per day by different categories of people including teacher, students, celebrities, officers, musician, etc. So, there is a huge amount of data that is increasing on a daily basis that need to be categorized. The important key feature is to find the keywords in the huge data that is helpful for identifying a twitter for classification. For this purpose, Term Frequency-Inverse Document Frequency (TF-IDF) and Loglikelihood methods are chosen for keywords extracted from the music field and perform a comparative analysis on both results. In the end, relevance is performed from 5 users so that finally we can take a decision to make assumption on the basis of experiments that which method is best. This analysis is much valuable because it gives a more accurate estimation which method's results are more reliable.

引用

页码：88 / 94

页数：7

共 33 条

[1] An Analysis of Sindhi Annotated Corpus using Supervised Machine Learning Methods [J].

Ali, Mazhar ;

Wagan, Asim Imdad .

MEHRAN UNIVERSITY RESEARCH JOURNAL OF ENGINEERING AND TECHNOLOGY, 2019, 38 (01) :185-196

[2]

Allan J., 2000, Proceedings of the Ninth International Conference on Information and Knowledge Management. CIKM 2000, P374, DOI 10.1145/354756.354843

[3]

Allan J., 2000, P TOP DET TRACK WORK, P167

[4]

[Anonymous], 2019, Twitter

[5]

Asghar M., 2014, International Journal of Computer Science Issues, V11, P177

[6]

Asmussen J., 2005, P CORP LING C

[7]

Benhardus James, 2013, International Journal of Web Based Communities, V9, P122

[8]

Brigadir I, 2014, Arxiv, DOI arXiv:1403.2923

[9]

Byrd K, 2016, 2016 IEEE/ACM INTERNATIONAL WORKSHOP ON SOFTWARE ENGINEERING IN HEALTHCARE SYSTEMS (SEHS), P43, DOI [10.1109/SEHS.2016.016, 10.1145/2897683.2897693]

[10]

Caiv Y, 2009, LECT NOTES ARTIF INT, V5712, P447, DOI 10.1007/978-3-642-04592-9_56

← 1 2 3 4 →