The approach of using ontology as a pre-knowledge source for semi-supervised labelled topic model by applying text dependency graph

被引:0
作者
Pham P. [1 ]
Do P. [1 ]
机构
[1] Faculty of Information Science and Engineering, University of Information Technology (UIT), VNU, HCM, Quarter 6, Linh Trung Ward, Thu Duc District, Ho Chi Minh City
关键词
Dependency graph; Labelled LDA; Labelled topic modelling; Latent dirichlet allocation; LDA; LLDA; Ontology-driven topic labelling; Topic identification;
D O I
10.1504/ijbidm.2021.115477
中图分类号
学科分类号
摘要
Multiple topics discovering from text is an important task in text mining. In the past, the supervised approaches fail to explore multiple topics in text. The topic modelling approach, such as: LSI, pLSI, LDA, etc. are considered as an unsupervised method which supports to discover distributions of multiple topics in text documents. The labelled LDA (LLDA) model is a supervised method which enables to integrate human labelled topics with the given text corpus during the process of modelling topics. However, in real applications, we may not have enough high qualified knowledge to properly assign the topics for all documents before applying the LLDA. In this paper, we present two approaches which have taken the advantage of dependency graph-of-words (GOW) in text analysis. The GOW approach uses frequent sub-graph mining (FSM) technique to extract graph-based concepts from the text. Our first approach is the method of using graph-based concepts for constructing domain-specific ontology. It is called GC2Onto model. In our second approach, the graph-based concepts are also applied to improve the quality of traditional LLDA. It is called LLDA-GOW model. We combine two GC2Onto and LLDA-GOW models to leverage the multiple topic identification as well as other mining tasks in the text. © 2021 Inderscience Enterprises Ltd.
引用
收藏
页码:488 / 523
页数:35
相关论文
共 48 条
  • [1] Aggarwal C.C., Zhai C.X., Mining Text Data, (2012)
  • [2] Andrzejewski D., Et al., Incorporating domain knowledge into topic modeling via Dirichlet forest priors, Proceedings of the 26th Annual International Conference on Machine Learning, pp. 25-32, (2009)
  • [3] Blei D.M., Jordan M.I., Modeling annotated data, Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 127-134, (2003)
  • [4] Blei D.M., Ng A.Y., Jordan M.I., Latent Dirichlet allocation, Journal of Machine Learning Research, pp. 993-1022, (2003)
  • [5] Chang H-C., Hsu C-C., Using topic keyword clusters for automatic document clustering, IEICE Transactions on Information and Systems, 88, 8, pp. 1852-1860, (2005)
  • [6] Clifton C., Cooley R., Rennie J., Topcat: data mining for topic identification in a text corpus, IEEE Transactions on Knowledge and Data Engineering, pp. 949-964, (2004)
  • [7] Cordella L.P., Et al., A (sub) graph isomorphism algorithm for matching large graphs, IEEE Transactions on Pattern Analysis and Machine Intelligence, 26, 10, pp. 1367-1372, (2004)
  • [8] Dai J., Xu Q., Attribute selection based on information gain ratio in fuzzy rough set theory with application to tumor classification, Applied Soft Computing, 13, 1, pp. 211-221, (2013)
  • [9] Dey A., Et al., New concepts on vertex and edge coloring of simple vague graphs, Symmetry, 10, 9, (2018)
  • [10] Dey A., Pal A., Pal T., Interval type 2 fuzzy set in fuzzy shortest path problem, Mathematics, 4, 4, (2016)