Semantic similarity of short texts in languages with a deficient natural language processing support

被引:29
作者
Furlan, Bojan [1 ]
Batanovic, Vuk [1 ]
Nikolic, Bosko [1 ]
机构
[1] Univ Belgrade, Sch Elect Engn, Dept Comp Engn & Informat Theory, Belgrade 11120, Serbia
关键词
Linguistic tools for IS modeling; Text DBs; Semantic similarity of words; Similarity of short texts; Corpus-based measures; Paraphrase corpora construction;
D O I
10.1016/j.dss.2013.02.002
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Measuring the semantic similarity of short texts is a noteworthy problem since short texts are widely used on the Internet, in the form of product descriptions or captions, image and webpage tags, news headlines, etc. This paper describes a methodology which can be used to create a software system capable of determining the semantic similarity of two given short texts. The proposed LInSTSS approach is particularly suitable for application in situations when no large, publicly available, electronic linguistic resources can be found for the desired language. We describe the basic working principles of the system architecture we propose, as well as the stages of its construction and use. Also, we explain the procedure used to generate a paraphrase corpus which is then utilized in the evaluation process. Finally, we analyze the evaluation results obtained from a system created for the Serbian language, and we discuss possible improvements which would increase system accuracy. (C) 2013 Elsevier B.V. All rights reserved.
引用
收藏
页码:710 / 719
页数:10
相关论文
共 15 条
[1]  
DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO
[2]  
2-9
[3]  
Dolan W., 2004, P 20 INT C COMPUTATI
[4]  
Furlan B., 2011, J. Inf. Technol. Appl, V1, P65
[5]  
Islam A., ACM T KNOWLEDGE DISC, V2
[6]  
Jurgens D., 2010, P ACL SYST DEM UPPS
[7]  
Keselj V., 2008, INFOTHECA J INFORM L, V9
[8]  
Lin Li, 2009, Proceedings of the 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2009), P487, DOI 10.1109/FSKD.2009.657
[9]  
Lo R.T., 2005, 5 DUTCH BELG INF RET
[10]  
Mihalcea R., 2006, AAAI, P775