Phrase2Vec: Phrase embedding based on parsing

被引:42
作者
Wu, Yongliang [1 ]
Zhao, Shuliang [2 ,3 ,4 ]
Li, Wenbin [5 ]
机构
[1] Hebei Normal Univ, Coll Math & Informat Sci, Shijiazhuang 050024, Hubei, Peoples R China
[2] Hebei Normal Univ, Coll Comp & Cyber Secur, Shijiazhuang 050024, Hebei, Peoples R China
[3] Hebei Prov Key Lab Network & Informat Secur, Shijiazhuang 050024, Hebei, Peoples R China
[4] Hebei Prov Engn Res Ctr Supply Chain Big Data Ana, Shijiazhuang 050024, Hebei, Peoples R China
[5] Hebei GEO Univ, Coll Informat Engn, Shijiazhuang 050024, Hebei, Peoples R China
关键词
Text representation; Phrase mining; Phrase embedding; Parsing; Text classification; Text clustering; DOCUMENT; REPRESENTATIONS; WORD;
D O I
10.1016/j.ins.2019.12.031
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text is one of the most common unstructured data, and usually, the most primary task in text mining is to transfer the text into a structured representation. However, the existing text representation models split the complete semantic unit and neglect the order of words, finally lead to understanding bias. In this paper, we propose a novel phrase-based text representation method that takes into account the integrity of semantic units and utilizes vectors to represent the similarity relationship between texts. First, we propose HPMBP (Hierarchical Phrase Mining Based on Parsing) which mines hierarchical phrases by parsing and uses BOP (Bag Of Phrases) to represent text. Then, we put forward three phrase embedding models, called Phrase2Vec, including Skip-Phrase, CBOP (Continuous Bag Of Phrases), and GloVeFP (Global Vectors For Phrase Representation). They learn the phrase vector with semantic similarity, further obtain the vector representation of the text. Based on Phrase2Vec, we propose PETC (Phrase Embedding based Text Classification) and PETCLU (Phrase Embedding based Text Clustering). PETC utilizes the phrase embedding to get the text vector, which is fed to a neural network for text classification. PETCLU gets the vectorization expression of text and cluster center by Phrase2Vec, furthermore extends the K-means model for text clustering. To the best of our knowledge, it is the first work that focuses on the phrase-based English text representation. Experiments show that the introduced Phrase2Vec outperforms state-of-the-art phrase embedding models in the similarity task and the analogical reasoning task on Enwiki, DBLP, and Yelp dataset. PETC is superior to the baseline text classification methods in the F1-value index by about 4%. PETCLU is also ahead of the prevalent text clustering methods in entropy and purity indicators. In summary, Phrase2Vec is a promising approach to text mining. (C) 2020 Elsevier Inc. All rights reserved.
引用
收藏
页码:100 / 127
页数:28
相关论文
共 48 条
[1]  
[Anonymous], 2013, NIPS
[2]  
[Anonymous], 2014, P 18 C COMP NATURAL
[3]  
[Anonymous], 2013, CORR
[4]  
[Anonymous], 26 INT C COMP LING
[5]  
[Anonymous], CORR
[6]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[7]   Self-Tuned Descriptive Document Clustering Using a Predictive Network [J].
Brockmeier, Austin J. ;
Mu, Tingting ;
Ananiadou, Sophia ;
Goulermas, John Y. .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2018, 30 (10) :1929-1942
[8]   Online multi-label dependency topic models for text classification [J].
Burkhardt, Sophie ;
Kramer, Stefan .
MACHINE LEARNING, 2018, 107 (05) :859-886
[9]   From Word to Sense Embeddings: A Survey on Vector Representations of Meaning [J].
Camacho-Collados, Jose ;
Pilehvar, Mohammad Taher .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2018, 63 :743-788
[10]   A Thorough Evaluation of Distance-Based Meta-Features for Automated Text Classification [J].
Canuto, Sergio ;
Sousa, Daniel Xavier ;
Goncalves, Marcos Andre ;
Rosa, Thierson Couto .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2018, 30 (12) :2242-2256