A deep network model for paraphrase detection in short text messages

被引:82
作者
Agarwal, Basant [1 ,2 ]
Ramampiaro, Heri [1 ]
Langseth, Helge [1 ]
Ruocco, Massimiliano [1 ,3 ]
机构
[1] Norwegian Univ Sci & Technol, Dept Comp Sci, Trondheim, Norway
[2] Swami Keshvanand Inst Technol Management & Gramot, Dept Comp Sci & Engn, Jaipur, Rajasthan, India
[3] Telenor Res, Trondheim, Norway
关键词
Paraphrase detection; Sentence similarity; Deep learning; RNN; CNN; PLAGIARISM;
D O I
10.1016/j.ipm.2018.06.005
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper is concerned with paraphrase detection, i.e., identifying sentences that are semantically identical. The ability to detect similar sentences written in natural language is crucial for several applications, such as text mining, text summarization, plagiarism detection, authorship authentication and question answering. Recognizing this importance, we study in particular how to address the challenges with detecting paraphrases in user generated short texts, such as Twitter, which often contain language irregularity and noise, and do not necessarily contain as much semantic information as longer clean texts. We propose a novel deep neural network-based approach that relies on coarse-grained sentence modelling using a convolutional neural network (CNN) and a recurrent neural network (RNN) model, combined with a specific fine-grained word level similarity matching model. More specifically, we develop a new architecture, called DeepParaphrase, which enables to create an informative semantic representation of each sentence by (1) using CNN to extract the local region information in form of important n-grams from the sentence, and (2) applying RNN to capture the long-term dependency information. In addition, we perform a comparative study on state-of-the-art approaches within paraphrase detection. An important insight from this study is that existing paraphrase approaches perform well when applied on clean texts, but they do not necessarily deliver good performance against noisy texts, and vice versa. In contrast, our evaluation has shown that the proposed DeepParaphrase-based approach achieves good results in both types of texts, thus making it more robust and generic than the existing approaches.
引用
收藏
页码:922 / 937
页数:16
相关论文
共 48 条
[1]   Paraphrase identification and semantic text similarity analysis in Arabic news tweets using lexical, syntactic, and semantic features [J].
Al-Smadi, Mohammad ;
Jaradat, Zain ;
Al-Ayyoub, Mahmoud ;
Jararweh, Yaser .
INFORMATION PROCESSING & MANAGEMENT, 2017, 53 (03) :640-652
[2]  
[Anonymous], 2015, P 9 INT WORKSHOP SEM
[3]  
[Anonymous], 2012, CoRR
[4]  
[Anonymous], 2004, P INT C COMP LING
[5]  
[Anonymous], 2015, P 9 INT WORKSHOP SEM
[6]  
[Anonymous], 2017, P 1 WORKSH SUBW CHAR, DOI DOI 10.18653/V1/W17-4121
[7]  
[Anonymous], 2015, P 9 INT WORKSHOP SEM
[8]  
Arora S., 2017, 5 INT C LEARN REPR I, P1
[9]  
Bojanowski P, 2017, Transactions of the Association for Computational Linguistics, V5, P135, DOI [10.1162/tacla00051, DOI 10.1162/TACLA00051, 10.1162/tacl_a_00051]
[10]  
Collobert R, 2011, J MACH LEARN RES, V12, P2493