Phrase2Vec: Phrase embedding based on parsing

被引:38
|
作者
Wu, Yongliang [1 ]
Zhao, Shuliang [2 ,3 ,4 ]
Li, Wenbin [5 ]
机构
[1] Hebei Normal Univ, Coll Math & Informat Sci, Shijiazhuang 050024, Hubei, Peoples R China
[2] Hebei Normal Univ, Coll Comp & Cyber Secur, Shijiazhuang 050024, Hebei, Peoples R China
[3] Hebei Prov Key Lab Network & Informat Secur, Shijiazhuang 050024, Hebei, Peoples R China
[4] Hebei Prov Engn Res Ctr Supply Chain Big Data Ana, Shijiazhuang 050024, Hebei, Peoples R China
[5] Hebei GEO Univ, Coll Informat Engn, Shijiazhuang 050024, Hebei, Peoples R China
关键词
Text representation; Phrase mining; Phrase embedding; Parsing; Text classification; Text clustering; DOCUMENT; REPRESENTATIONS; WORD;
D O I
10.1016/j.ins.2019.12.031
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text is one of the most common unstructured data, and usually, the most primary task in text mining is to transfer the text into a structured representation. However, the existing text representation models split the complete semantic unit and neglect the order of words, finally lead to understanding bias. In this paper, we propose a novel phrase-based text representation method that takes into account the integrity of semantic units and utilizes vectors to represent the similarity relationship between texts. First, we propose HPMBP (Hierarchical Phrase Mining Based on Parsing) which mines hierarchical phrases by parsing and uses BOP (Bag Of Phrases) to represent text. Then, we put forward three phrase embedding models, called Phrase2Vec, including Skip-Phrase, CBOP (Continuous Bag Of Phrases), and GloVeFP (Global Vectors For Phrase Representation). They learn the phrase vector with semantic similarity, further obtain the vector representation of the text. Based on Phrase2Vec, we propose PETC (Phrase Embedding based Text Classification) and PETCLU (Phrase Embedding based Text Clustering). PETC utilizes the phrase embedding to get the text vector, which is fed to a neural network for text classification. PETCLU gets the vectorization expression of text and cluster center by Phrase2Vec, furthermore extends the K-means model for text clustering. To the best of our knowledge, it is the first work that focuses on the phrase-based English text representation. Experiments show that the introduced Phrase2Vec outperforms state-of-the-art phrase embedding models in the similarity task and the analogical reasoning task on Enwiki, DBLP, and Yelp dataset. PETC is superior to the baseline text classification methods in the F1-value index by about 4%. PETCLU is also ahead of the prevalent text clustering methods in entropy and purity indicators. In summary, Phrase2Vec is a promising approach to text mining. (C) 2020 Elsevier Inc. All rights reserved.
引用
收藏
页码:100 / 127
页数:28
相关论文
共 50 条
  • [1] Treebank-Based Probabilistic Phrase Structure Parsing
    Cahill, Aoife
    LANGUAGE AND LINGUISTICS COMPASS, 2008, 2 (01): : 36 - 58
  • [2] PHRASE STRUCTURE PARSING AND THE ISLAND CONSTRAINTS
    FODOR, JD
    LINGUISTICS AND PHILOSOPHY, 1983, 6 (02) : 163 - 223
  • [3] ParsingPhrase: Parsing-based automated quality phrase mining
    Wu, Yongliang
    Zhao, Shuliang
    Dou, Shimao
    Li, Jinghui
    INFORMATION SCIENCES, 2023, 633 : 531 - 548
  • [4] Query Intent Detection Based on Clustering of Phrase Embedding
    Gu, Jiahui
    Feng, Chong
    Gao, Xiong
    Wang, Yashen
    Huang, Heyan
    SOCIAL MEDIA PROCESSING, SMP 2016, 2016, 669 : 110 - 122
  • [5] Parsing discontinuous phrase structure with grammatical functions
    Hall, Johan
    Nivre, Joakim
    ADVANCES IN NATURAL LANGUAGE PROCESSING, PROCEEDINGS, 2008, 5221 : 169 - +
  • [6] Phrase Structure Annotation and Parsing for Learner English
    Nagata, Ryo
    Sakaguchi, Keisuke
    PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2016, : 1837 - 1847
  • [7] Phrase-aware Unsupervised Constituency Parsing
    Gu, Xiaotao
    Shen, Yikang
    Shen, Jiaming
    Shang, Jingbo
    Han, Jiawei
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 6406 - 6415
  • [8] PARSING FOR GENERAL PHRASE-STRUCTURE GRAMMARS
    LOECKX, J
    INFORMATION AND CONTROL, 1970, 16 (05): : 443 - &
  • [9] Joint Visual Phrase Detection to Boost Scene Parsing
    Tang, Keke
    Zhao, Zhe
    Chen, Xiaoping
    ADVANCES IN VISUAL COMPUTING, PT II (ISVC 2015), 2015, 9475 : 389 - 399
  • [10] Amharic Sentence Parsing Using Base Phrase Chunking
    Ibrahim, Abeba
    Assabie, Yaregal
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, CICLING 2014, PT I, 2014, 8403 : 297 - 306