Nonparametric method of topic identification using granularity concept and graph-based modeling

被引:4
作者
Ganguli, Isha [1 ]
Sil, Jaya [1 ]
Sengupta, Nandita [2 ]
机构
[1] Indian Inst Engn Sci & Technol, Dept Comp Sci & Technol, Sibpur, Howrah, India
[2] Univ Coll Bahrain, Dept Informat Technol, Janabiyah, Bahrain
关键词
Granularity; Point-wise mutual information; Graph-based modeling; Hierarchical structure; Computationally efficient algorithm; DOCUMENT; CLASSIFICATION;
D O I
10.1007/s00521-020-05662-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper aims to classify the large unstructured documents into different topics without involving huge computational resources and a priori knowledge. The concept of granularity is employed here to extract contextual information from the documents by generating granules of words (GoWs), hierarchically. The proposed granularity-based word grouping (GBWG) algorithm in a computationally efficient way group the words at different layers by using co-occurrence measure between the words of different granules. The GBWG algorithm terminates when no new GoW is generated at any layer of the hierarchical structure. Thus multiple GoWs are obtained, each of which contains contextually related words, representing different topics. However, the GoWs may contain common words and creating ambiguity in topic identification. Louvain graph clustering algorithm has been employed to automatically identify the topics, containing unique words by using mutual information as an association measure between the words (nodes) of each GoW. A test document is classified into a particular topic based on the probability of its unique words belong to different topics. The performance of the proposed method has been compared with other unsupervised, semi-supervised, and supervised topic modeling algorithms. Experimentally, it has been shown that the proposed method is comparable or better than the state-of-the-art topic modeling algorithms which further statistically verified with the Wilcoxon Rank-sum Test.
引用
收藏
页码:1055 / 1075
页数:21
相关论文
共 74 条
[1]  
Almeida H, 2011, LECT NOTES ARTIF INT, V6911, P44, DOI 10.1007/978-3-642-23780-5_13
[2]  
[Anonymous], 2017, EVALUATION CLUSTERIN
[3]  
[Anonymous], 2016, Sentence level recurrent topic model: Letting topics speak for themselves
[4]   Task recommender system using semantic clustering to identify the right personnel [J].
Bafna, Prafulla ;
Shirwaikar, Shailaja ;
Pramod, Dhanya .
VINE JOURNAL OF INFORMATION AND KNOWLEDGE MANAGEMENT SYSTEMS, 2019, 49 (02) :181-199
[5]   A web-based intelligent report e-learning system using data mining techniques [J].
Blagojevic, Marija ;
Micic, Zivadin .
COMPUTERS & ELECTRICAL ENGINEERING, 2013, 39 (02) :465-474
[6]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[7]   Fast unfolding of communities in large networks [J].
Blondel, Vincent D. ;
Guillaume, Jean-Loup ;
Lambiotte, Renaud ;
Lefebvre, Etienne .
JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2008,
[8]   SRDA: An efficient algorithm for large-scale discriminant analysis [J].
Cai, Deng ;
He, Xiaofei ;
Han, Jiawei .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2008, 20 (01) :1-12
[9]   Application of a recurrent wavelet fuzzy-neural network in the positioning control of a magnetic-bearing mechanism [J].
Chen, Syuan-Yi ;
Hung, Ying-Chih ;
Hung, Yi-Hsuan ;
Wu, Chien-Hsun .
COMPUTERS & ELECTRICAL ENGINEERING, 2016, 54 :147-158
[10]  
DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO