A resource-light method for cross-lingual semantic textual similarity

被引:26
作者
Glavas, Goran [1 ]
Franco-Salvador, Marc [2 ,3 ]
Ponzetto, Simone P. [1 ]
Rosso, Paolo [3 ]
机构
[1] Univ Mannheim, Sch Business Informat & Matemath, Data & Web Sci Grp, B6 26, DE-68159 Mannheim, Germany
[2] Symanto Res, Pretzfelder Str 15, DE-90425 Nurnberg, Germany
[3] Univ Politecn Valencia, Pattern Recognit & Human Language Technol Res Ctr, Camino Vera S-N, ES-46022 Valencia, Spain
关键词
Semantic textual similarity; Cross-lingual; Word embeddings; Word alignment; Parallel sentences alignment; Plagiarism detection;
D O I
10.1016/j.knosys.2017.11.041
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recognizing semantically similar sentences or paragraphs across languages is beneficial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for predicting cross-lingual semantic similarity of short texts, however, make use of tools and resources (e.g., machine translation systems, syntactic parsers or named entity recognition) that for many languages (or language pairs) do not exist. In contrast, we propose an unsupervised and a very resource-light approach for measuring semantic similarity between texts in different languages. To operate in the bilingual (or multilingual) space, we project continuous word vectors (i.e., word embeddings) from one language to the vector space of the other language via the linear translation model. We then align words according to the similarity of their vectors in the bilingual embedding space and investigate different unsupervised measures of semantic similarity exploiting bilingual embeddings and word alignments. Requiring only a limited-size set of word translation pairs between the languages, the proposed approach is applicable to virtually any pair of languages for which there exists a sufficiently large corpus, required to learn monolingual word embeddings. Experimental results on three different datasets for measuring semantic textual similarity show that our simple resource-light approach reaches performance close to that of supervised and resource-intensive methods, displaying stability across different language pairs. Furthermore, we evaluate the proposed method on two extrinsic tasks, namely extraction of parallel sentences from comparable corpora and cross-lingual plagiarism detection, and show that it yields performance comparable to those of complex resource-intensive state-of-the-art models for the respective tasks. (C) 2017 Published by Elsevier B.V.
引用
收藏
页码:1 / 9
页数:9
相关论文
共 48 条
  • [1] Agirre E, 2015, P 9 INT WORKSH SEM E, P252, DOI 10.18653/v1/S15-2045
  • [2] [Anonymous], 2014, Proceedings of the 8th International Workshop on Semantic Evaluation, DOI 10.3115/v1/S14-2039
  • [3] [Anonymous], 2012, Proceedings of the First Joint Conference on Lexical and Computational Semantics
  • [4] [Anonymous], SEMEVAL
  • [5] [Anonymous], P 10 INT WORKSH SEM
  • [6] [Anonymous], 2016, SEMEVAL 2016 10 INT
  • [7] [Anonymous], 1998, WordNet, DOI DOI 10.7551/MITPRESS/7287.001.0001
  • [8] [Anonymous], ARXIV13094168 CORR
  • [9] [Anonymous], OVERVIEW 3 INT COMPE
  • [10] [Anonymous], 2012, P 6 INT WORKSHOP SEM