Two-Stage Topic Extraction Model for Bibliometric Data Analysis Based on Word Embeddings and Clustering

被引:208
作者
Onan, Aytug [1 ]
机构
[1] Izmir Katip Celebi Univ, Comp Engn Dept, TR-35620 Izmir, Turkey
关键词
Topic extraction; machine learning; cluster analysis; text mining; SCIENCE; ENSEMBLE;
D O I
10.1109/ACCESS.2019.2945911
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Topic extraction is an essential task in bibliometric data analysis, data mining and knowledge discovery, which seeks to identify significant topics from text collections. The conventional topic extraction schemes require human intervention and involve also comprehensive pre-processing tasks to represent text collections in an appropriate way. In this paper, we present a two-stage framework for topic extraction from scientific literature. The presented scheme employs a two-staged procedure, where word embedding schemes have been utilized in conjunction with cluster analysis. To extract significant topics from text collections, we propose an improved word embedding scheme, which incorporates word vectors obtained by word2vec, POS2vec, word-position2vec and LDA2vec schemes. In the clustering phase, an improved clustering ensemble framework, which incorporates conventional clustering methods (i.e., k-means, k-modes, k-means CC, self-organizing maps and DIANA algorithm) by means of the iterative voting consensus, has been presented. In the empirical analysis, we analyze a corpus containing 160,424 abstracts of articles from various disciplines, including agricultural engineering, economics, engineering and computer science. In the experimental analysis, performance of the proposed scheme has been compared to conventional baseline clustering methods (such as, k-means, k-modes, and k-means CC), LDA-based topic modelling and conventional word embedding schemes. The empirical analysis reveals that ensemble word embedding scheme yields better predictive performance compared to the baseline word vectors for topic extraction. Ensemble clustering framework outperforms the baseline clustering methods. The results obtained by the proposed framework show an improvement in Jaccard coefficient, Folkes & Mallows measure and F1 score.
引用
收藏
页码:145614 / 145633
页数:20
相关论文
共 59 条
[1]  
[Anonymous], 2007, SOC IND APPL MATH
[2]  
[Anonymous], 2010, PYTHON TEXT PROCESSI
[3]   ADM-LDA: An aspect detection model based on topic modelling using the structure of review sentences [J].
Bagheri, Ayoub ;
Saraee, Mohamad ;
de Jong, Franciska .
JOURNAL OF INFORMATION SCIENCE, 2014, 40 (05) :621-636
[4]   Heterogeneous classifiers fusion for dynamic breast cancer diagnosis using weighted vote based ensemble [J].
Bashir, Saba ;
Qamar, Usman ;
Khan, Farhan Hassan .
QUALITY & QUANTITY, 2015, 49 (05) :2061-2076
[5]   Cluster ensembles: A survey of approaches with recent extensions and applications [J].
Boongoen, Tossapon ;
Iam-On, Natthakan .
COMPUTER SCIENCE REVIEW, 2018, 28 :1-25
[6]  
Bougouin Adrien, 2013, P 6 INT JOINT C NAT, P543
[7]   Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches [J].
Boyack, Kevin W. ;
Newman, David ;
Duhon, Russell J. ;
Klavans, Richard ;
Patek, Michael ;
Biberstine, Joseph R. ;
Schijvenaars, Bob ;
Skupin, Andre ;
Ma, Nianli ;
Boerner, Katy .
PLOS ONE, 2011, 6 (03)
[8]   From Image to Text Classification: A Novel Approach based on Clustering Word Embeddings [J].
Butnaru, Andrei M. ;
Ionescu, Radu Tudor .
KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS, 2017, 112 :1783-1792
[9]   Semantics derived automatically from language corpora contain human-like biases [J].
Caliskan, Aylin ;
Bryson, Joanna J. ;
Narayanan, Arvind .
SCIENCE, 2017, 356 (6334) :183-186
[10]   Hybrid hierarchical clustering with applications to microarray data [J].
Chipman, H ;
Tibshirani, R .
BIOSTATISTICS, 2006, 7 (02) :286-301