Short Text Clustering Algorithms, Application and Challenges: A Survey

被引:23
作者
Ahmed, Majid Hameed [1 ,2 ]
Tiun, Sabrina [1 ]
Omar, Nazlia [1 ]
Sani, Nor Samsiah [1 ]
机构
[1] Univ Kebangsaan Malaysia, Fac Informat Sci & Technol, CAIT, Bangi 43600, Selangor, Malaysia
[2] Minist Higher Educ & Sci Res, Baghdad 10065, Iraq
来源
APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 01期
关键词
short text; text representation; dimensionality reduction; clustering techniques; short text clustering; INDEPENDENT COMPONENT ANALYSIS; CONVOLUTIONAL NEURAL-NETWORKS; SIMILARITY MEASURE; TWITTER; REPRESENTATION; SEARCH;
D O I
10.3390/app13010342
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The number of online documents has rapidly grown, and with the expansion of the Web, document analysis, or text analysis, has become an essential task for preparing, storing, visualizing and mining documents. The texts generated daily on social media platforms such as Twitter, Instagram and Facebook are vast and unstructured. Most of these generated texts come in the form of short text and need special analysis because short text suffers from lack of information and sparsity. Thus, this topic has attracted growing attention from researchers in the data storing and processing community for knowledge discovery. Short text clustering (STC) has become a critical task for automatically grouping various unlabelled texts into meaningful clusters. STC is a necessary step in many applications, including Twitter personalization, sentiment analysis, spam filtering, customer reviews and many other social network-related applications. In the last few years, the natural-language-processing research community has concentrated on STC and attempted to overcome the problems of sparseness, dimensionality, and lack of information. We comprehensively review various STC approaches proposed in the literature. Providing insights into the technological component should assist researchers in identifying the possibilities and challenges facing STC. To gain such insights, we review various literature, journals, and academic papers focusing on STC techniques. The contents of this study are prepared by reviewing, analysing and summarizing diverse types of journals and scholarly articles with a focus on the STC techniques from five authoritative databases: IEEE Xplore, Web of Science, Science Direct, Scopus and Google Scholar. This study focuses on STC techniques: text clustering, challenges to short texts, pre-processing, document representation, dimensionality reduction, similarity measurement of short text and evaluation.
引用
收藏
页数:38
相关论文
共 174 条
[1]   Enhanced clustering models with wiki-based k-nearest neighbors-based representation for web search result clustering [J].
Abdulameer, Ali Sabah ;
Tiun, Sabrina ;
Sani, Nor Samsiah ;
Ayob, Masri ;
Taha, Adil Yaseen .
JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2022, 34 (03) :840-850
[2]  
Abdullah A, 2020, Asia-Pacific Journal of Information Technology and Multimedia, V09, P103, DOI [10.17576/apjitm-2020-0902-08, 10.17576/apjitm-2020-0902-08, DOI 10.17576/APJITM-2020-0902-08]
[3]  
Abualigah L.M. Q., 2019, Feature Selection and Enhanced Krill Herd Algorithm for Text Document Clustering, V816, DOI DOI 10.1007/978-3-030-10674-4
[4]  
Agarwal S., 2012, P 2012 STUDENTS C EN, P1
[5]  
Aggarwal CharuC., 2012, Mining Text Data, P163, DOI [10.1007/978-1-4614-3223-4_4, DOI 10.1007/978-1-4614-3223-4_4, DOI 10.1007/978-1-4614-3223-46]
[6]   Combining clustering and classification ensembles: A novel pipeline to identify breast cancer profiles [J].
Agrawal, Utkarsh ;
Soria, Daniele ;
Wagner, Christian ;
Garibaldi, Jonathan ;
Ellis, Ian O. ;
Bartlett, John M. S. ;
Cameron, David ;
Rakha, Emad A. ;
Green, Andrew R. .
ARTIFICIAL INTELLIGENCE IN MEDICINE, 2019, 97 :27-37
[7]  
Ahmed M.H., 2013, P INT C ISL APPL COM, P2
[8]   Sparse Poisson Latent Block Model for Document Clustering [J].
Ailem, Melissa ;
Role, Francois ;
Nadif, Mohamed .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (07) :1563-1576
[9]   A Conceptual and Systematics for Intelligent Power Management System-Based Cloud Computing: Prospects, and Challenges [J].
AL-Jumaili, Ahmed Hadi Ali ;
Mashhadany, Yousif I. Al ;
Sulaiman, Rossilawati ;
Alyasseri, Zaid Abdi Alkareem .
APPLIED SCIENCES-BASEL, 2021, 11 (21)
[10]  
Al-Omari O M., 2011, Academic Research International, V1, P284