Phrase2Vec: Phrase embedding based on parsing

被引:38
|
作者
Wu, Yongliang [1 ]
Zhao, Shuliang [2 ,3 ,4 ]
Li, Wenbin [5 ]
机构
[1] Hebei Normal Univ, Coll Math & Informat Sci, Shijiazhuang 050024, Hubei, Peoples R China
[2] Hebei Normal Univ, Coll Comp & Cyber Secur, Shijiazhuang 050024, Hebei, Peoples R China
[3] Hebei Prov Key Lab Network & Informat Secur, Shijiazhuang 050024, Hebei, Peoples R China
[4] Hebei Prov Engn Res Ctr Supply Chain Big Data Ana, Shijiazhuang 050024, Hebei, Peoples R China
[5] Hebei GEO Univ, Coll Informat Engn, Shijiazhuang 050024, Hebei, Peoples R China
关键词
Text representation; Phrase mining; Phrase embedding; Parsing; Text classification; Text clustering; DOCUMENT; REPRESENTATIONS; WORD;
D O I
10.1016/j.ins.2019.12.031
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text is one of the most common unstructured data, and usually, the most primary task in text mining is to transfer the text into a structured representation. However, the existing text representation models split the complete semantic unit and neglect the order of words, finally lead to understanding bias. In this paper, we propose a novel phrase-based text representation method that takes into account the integrity of semantic units and utilizes vectors to represent the similarity relationship between texts. First, we propose HPMBP (Hierarchical Phrase Mining Based on Parsing) which mines hierarchical phrases by parsing and uses BOP (Bag Of Phrases) to represent text. Then, we put forward three phrase embedding models, called Phrase2Vec, including Skip-Phrase, CBOP (Continuous Bag Of Phrases), and GloVeFP (Global Vectors For Phrase Representation). They learn the phrase vector with semantic similarity, further obtain the vector representation of the text. Based on Phrase2Vec, we propose PETC (Phrase Embedding based Text Classification) and PETCLU (Phrase Embedding based Text Clustering). PETC utilizes the phrase embedding to get the text vector, which is fed to a neural network for text classification. PETCLU gets the vectorization expression of text and cluster center by Phrase2Vec, furthermore extends the K-means model for text clustering. To the best of our knowledge, it is the first work that focuses on the phrase-based English text representation. Experiments show that the introduced Phrase2Vec outperforms state-of-the-art phrase embedding models in the similarity task and the analogical reasoning task on Enwiki, DBLP, and Yelp dataset. PETC is superior to the baseline text classification methods in the F1-value index by about 4%. PETCLU is also ahead of the prevalent text clustering methods in entropy and purity indicators. In summary, Phrase2Vec is a promising approach to text mining. (C) 2020 Elsevier Inc. All rights reserved.
引用
收藏
页码:100 / 127
页数:28
相关论文
共 50 条
  • [31] An SDN architecture for patent prior art search system based on phrase embedding
    Geng, Boting
    Wang, Feng
    AUTOMATED SOFTWARE ENGINEERING, 2022, 29 (02)
  • [32] Text-based emotion recognition using contextual phrase embedding model
    Priya, R. Vishnu
    Nag, Prashant Kumar
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (23) : 35329 - 35355
  • [33] A Reordering Model Based on Shallow Parsing for Improving Phrase-Based Statistical Machine Translation Systems
    Chen, Yidong
    Shi, Xiaodong
    Zhou, Changle
    Hong, Qingyang
    INFORMATION-AN INTERNATIONAL INTERDISCIPLINARY JOURNAL, 2009, 12 (02): : 297 - 309
  • [34] Do contrastive accents modulate the effect of intonational phrase boundaries in parsing?
    Lee, Eun-Kyung
    Garnsey, Susan M.
    LINGUA, 2012, 122 (14) : 1763 - 1775
  • [35] Distinguishing Syntactic Operations in the Brain: Dependency and Phrase-Structure Parsing
    Lopopolo, Alessandro
    van den Bosch, Antal
    Petersson, Karl-Magnus
    Willems, Roel M.
    NEUROBIOLOGY OF LANGUAGE, 2021, 2 (01): : 152 - 175
  • [36] TransPhrase: A new method for generating phrase embedding from word embedding in Chinese
    Li, Rongsheng
    Huang, Shaobin
    Mao, Xiangke
    He, Jie
    Shen, Linshan
    EXPERT SYSTEMS WITH APPLICATIONS, 2021, 168
  • [37] Parsing and encoding interactive phrase structure for implicit discourse relation recognition
    Xiang W.
    Liu S.
    Wang B.
    Neural Computing and Applications, 2024, 36 (22) : 13783 - 13797
  • [38] Head-Driven Phrase Structure Grammar Parsing on Penn Treebank
    Zhou, Junru
    Zhao, Hai
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 2396 - 2408
  • [39] Automatic acquisition of phrase translation pairs based on head-phrase extending
    MOE-MS Key Laboratory of Natural Language Processing and Speech, Harbin 150001, China
    不详
    Gaojishu Tongxin, 2006, 9 (893-898):
  • [40] Phrase table filtration based on virtual context in phrase-based statistical machine translation
    Yin, Yue
    Zhang, Yu Jie
    Xu, Jin An
    INFORMATION TECHNOLOGY AND COMPUTER APPLICATION ENGINEERING, 2014, : 327 - 330