A Heuristic Based Pre-processing Methodology for Short Text Similarity Measures in Microblogs

被引:3
作者
Alnajran, Noufa [1 ]
Crockett, Keeley [1 ]
McLean, David [1 ]
Latham, Annabel [1 ]
机构
[1] Manchester Metropolitan Univ, Sch Comp Math & Digital Technol, Manchester, Lancs, England
来源
IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS / IEEE 16TH INTERNATIONAL CONFERENCE ON SMART CITY / IEEE 4TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS) | 2018年
关键词
Twitter; Short Text Similarity; Text Mining; Natural Language Processing; TWITTER;
D O I
10.1109/HPCC/SmartCity/DSS.2018.00265
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Short text similarity measures have lots of applications in online social networks (OSN), as they are being integrated in machine learning algorithms. However, the data quality is a major challenge in most OSNs, particularly Twitter. The sparse, ambiguous, informal, and unstructured nature of the medium impose difficulties to capture the underlying semantics of the text. Therefore, text pre-processing is a crucial phase in similarity identification applications, such as clustering and classification. This is because selecting the appropriate data processing methods contributes to the increase in correlations of the similarity measure. This research proposes a novel heuristic-driven pre-processing methodology for enhancing the performance of similarity measures in the context of Twitter tweets. The components of the proposed pre-processing methodology are discussed and evaluated on an annotated dataset that was published as part of SemEval-2014 shared task. An experimental analysis was conducted using the cosine angle as a similarity measure to assess the effect of our method against a baseline (C-Method). Experimental results indicate that our approach outperforms the baseline in terms of correlations and error rates.
引用
收藏
页码:1627 / 1633
页数:7
相关论文
共 24 条
[1]  
Agirre Eneko, 2012, P 1 JOINT C LEX COMP, V1
[2]  
[Anonymous], 2015, P 9 INT WORKSH SEM E
[3]  
[Anonymous], 2012, FEATURES EXTRACTION
[4]  
[Anonymous], 2014, TREC
[5]  
[Anonymous], SMART CIT SOCIALCOM
[6]  
[Anonymous], 2014, STOPWORDS FILTERING
[7]  
Antenucci Dolan., 2011, EECS, V545, P1
[8]  
Bao Y., 2014, INT C INT COMP
[9]  
Bicici E., 2015, P 9 INT WORKSH SEM E
[10]  
Brown P. F., 1992, Computational Linguistics, V18, P467