Linguistic analysis of datasets for semantic textual similarity

被引:1
作者
Wang, Chunlin [1 ]
Castellon, Irene [2 ]
Comelles, Elisabet [3 ]
机构
[1] Artificial Solut Iberia SL, Carrer Calabria 169, Barcelona, Catalonia, Spain
[2] Univ Barcelona, Dept Filol Catalana & Linguist Gen, Barcelona, Spain
[3] Univ Barcelona, Dept Lenguas & Literaturas Modernas & Estudios In, Barcelona, Spain
关键词
D O I
10.1093/llc/fqy076
中图分类号
C [社会科学总论];
学科分类号
03 ; 0303 ;
摘要
Semantic Textual Similarity (STS), which measures the equivalence of meanings between two textual segments, is an important and useful task in Natural Language Processing. In this article, we have analyzed the datasets provided by the Semantic Evaluation (SemEval) 2012-2014 campaigns for this task in order to find out appropriate linguistic features for each dataset, taking into account the influence that linguistic features at different levels (e.g. syntactic constituents and lexical semantics) might have on the sentence similarity. Results indicate that a linguistic feature may have a different effect on different corpus due to the great difference in sentence structure and vocabulary between datasets. Thus, we conclude that the selection of linguistic features according to the genre of the text might be a good strategy for obtaining better results in the STS task. This analysis could be a useful reference for measuring system building and linguistic feature tuning.
引用
收藏
页码:471 / 484
页数:14
相关论文
共 23 条
[1]  
Agirre E., 2014, P 8 INT WORKSH SEM E, P81, DOI DOI 10.3115/V1/S14-2010
[2]  
Agirre Eneko, 2012, P 6 SEMEVAL NAACL HL, P385, DOI DOI 10.5555/2387636.2387697
[3]  
[Anonymous], 2012, SEM 2012 1 JOINT C L
[4]  
[Anonymous], 2004, P INT C COMP LING
[5]  
[Anonymous], 2011, ACL
[6]  
[Anonymous], 2008, P 3 WORKSHOP STAT MA
[7]  
BAKER C F, 1998, 36 ANN M ASS COMP LI, P86, DOI DOI 10.3115/980845.980860
[8]  
Bar D., 2012, P 6 INT WORKSH SEM E, P435
[9]  
Best C., 2005, 22173 EUR EN
[10]  
Callison-Burch Chris, 2007, P 2 WORKSHOP STAT MA, P136