Sentence similarity based on semantic nets and corpus statistics

被引：450

作者：

Li, Yuhua ^{[1
]}

McLean, David

Bandar, Zuhair A.

O'Shea, James D.

Crockett, Keeley

机构：

[1] Univ Ulster, Sch Comp & Intelligent Syst, Coleraine BT48 7JL, Londonderry, North Ireland

[2] Manchester Metropolitan Univ, Dept Comp & Math, Manchester M1 5GD, Lancs, England

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2006年 / 18卷 / 08期

关键词：

sentence similarity; semantic nets; corpus; natural language processing; word similarity;

D O I：

10.1109/TKDE.2006.130

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Sentence similarity measures play an increasingly important role in text-related research and applications in areas such as text mining, Web page retrieval, and dialogue systems. Existing methods for computing sentence similarity have been adopted from approaches used for long text documents. These methods process sentences in a very high-dimensional space and are consequently inefficient, require human input, and are not adaptable to some application domains. This paper focuses directly on computing the similarity between very short texts of sentence length. It presents an algorithm that takes account of semantic information and word order information implied in the sentences. The semantic similarity of two sentences is calculated using information from a structured lexical database and from corpus statistics. The use of a lexical database enables our method to model human common sense knowledge and the incorporation of corpus statistics allows our method to be adaptable to different domains. The proposed method can be used in a variety of applications that involve text knowledge representation and discovery. Experiments on two sets of selected sentence pairs demonstrate that the proposed method provides a similarity measure that shows a significant correlation to human intuition.

引用

页码：1138 / 1150

页数：13

共 38 条

[11] The measurement of textual coherence with latent semantic analysis [J].

Foltz, PW ;

Kintsch, W ;

Landauer, TK .

DISCOURSE PROCESSES, 1998, 25 (2-3) :285-307

[12]

*GEN ONT CONS, 2005, GEN ONT SOFTW DAT

[13]

HATZIVASSILOGLO.V, 1999, P JOINT SIGDAT C EMP

[14]

Hatzivassiloglou V, 1999, P 37 ANN M ASS COMP

[15]

Jurafsky D., 2000, Speech and Language Processing. An Introduction to Natural language Processing, Computational Linguistics

[16] Improving text categorization using the importance of sentences [J].

Ko, Y ;

Park, J ;

Seo, J .

INFORMATION PROCESSING & MANAGEMENT, 2004, 40 (01) :65-79

[17]

Landauer TK, 1998, ADV NEUR IN, V10, P45

[18]

Landauer TK, 1997, PROCEEDINGS OF THE NINETEENTH ANNUAL CONFERENCE OF THE COGNITIVE SCIENCE SOCIETY, P412

[19] An introduction to latent semantic analysis [J].

Landauer, TK ;

Foltz, PW ;

Laham, D .

DISCOURSE PROCESSES, 1998, 25 (2-3) :259-284

[20]

Li YH, 2003, IEEE T KNOWL DATA EN, V15, P871, DOI 10.1109/TKDE.2003.1209005

← 1 2 3 4 →