Graph-based clustering of extracted paraphrases for labelling crime reports

被引:24
作者
Das, Priyanka [1 ]
Das, Asit Kumar [1 ]
机构
[1] Indian Inst Engn Sci & Technol, Dept Comp Sci & Technol, Sibpur 711103, Howrah, India
关键词
Crime analysis; Text mining; Entity recognition; Graph clustering; Clustering coefficient; Paraphrase extraction; Sparse graph; Sparsity measure; Edge density; Gini index; COMMUNITY STRUCTURE;
D O I
10.1016/j.knosys.2019.05.004
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Paraphrases are well-known as synonyms that express the same context in different articulations. Extracting paraphrases from a large text corpus is a challenging task in Natural Language Processing applications. The present work proposes a graph based clustering technique for discovering labels of crime reports based on extracted paraphrases from large untagged crime corpora. Initially, the entity pairs are represented as shallow parse trees where the headword in each tree reflects the actual meaning of the phrase between the entities. Though the phrases having similar headwords have been collected together, there exist many phrases between the entities that express similar context without sharing the same headword. Therefore, clustering is done to create a group of similar meaning phrases termed as paraphrases. A complete weighted graph is constructed with the phrases as nodes and cosine similarity between pair of phrases as the weight of an edge with the phrases as terminal nodes. The graph is made sparse by removing edges with weights less than a threshold value and clustering coefficient has been calculated for each node. The subgraph(s) comprising node(s) with the highest clustering coefficient has been extracted with their adjacent edges. The remaining nodes with their adjacent edges in the graph are added one at a time to an extracted subgraph, if and only if the average clustering coefficient of the resultant subgraph increases and an agglomerative merging technique is applied to merge the extracted subgraphs until no merging takes place. Finally, each subgraph represents a cluster of phrases, yields one aspect of crime. Based on the extracted paraphrases, the reports can be easily labelled. The proposed work deals with crime reports for United States of America (USA), United Arab Emirates (UAE) and India and the evaluation is performed in terms of various supervised and unsupervised techniques. (C) 2019 Elsevier B.V. All rights reserved.
引用
收藏
页码:55 / 76
页数:22
相关论文
共 53 条
[1]  
[Anonymous], J BIG DATA
[2]  
[Anonymous], 2018, NEURAL COMPUT APPL
[3]  
[Anonymous], 2007, EXPLORING NEWSPAPERS
[4]  
[Anonymous], EDM
[5]  
[Anonymous], 2002, P ACL 02 WORKSHOP EF
[6]  
[Anonymous], P HUM LANG TECHN C N
[7]  
[Anonymous], 1998, WORDNET ELECT LEXICA, DOI DOI 10.7551/MITPRESS/7287.001.0001
[8]  
[Anonymous], 11 ANN M JAP ASS NAT
[9]  
[Anonymous], 2005, P IWP
[10]  
[Anonymous], 2008, J STAT MECH-THEORY E, P1