Wikipedia-based Semantic Interpretation for Natural Language Processing

被引:204
作者
Gabrilovich, Evgeniy [1 ]
Markovitch, Shaul [1 ]
机构
[1] Technion Israel Inst Technol, Dept Comp Sci, IL-32000 Haifa, Israel
关键词
TEXT CATEGORIZATION; FEATURE GENERATION; SIMILARITY; CLASSIFIERS; WORDNET;
D O I
10.1613/jair.2669
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Adequate representation of natural language semantics requires access to vast amounts of common sense and domain-specific world knowledge. Prior work in the field was based on purely statistical techniques that did not make use of background knowledge, on limited lexicographic knowledge bases such as Word Net, or on huge manual efforts such as the CYC project. Here we propose novel method, called Explicit Semantic Analysis (ESA), for fine-grained semantic interpretation of unrestricted natural language texts. Our method represents meaning in a high-dimensional space of concepts derived from Wikipedia, the largest encyclopedia in existence. We explicitly represent the meaning of any text in terms of Wikipedia-based concepts. We evaluate the effectiveness of our method on text categorization and on computing the degree of semantic relatedness between fragments of natural language text. Using ESA results insignificant improvements over the previous state of the art in both tasks. Importantly, due to the use of natural concepts, the ESA model is easy to explain to human users.
引用
收藏
页码:443 / 498
页数:56
相关论文
共 130 条
[1]  
Adafre S., 2005, Proceedings of the 3rd International Workshop on Link Discovery, P90
[2]  
[Anonymous], COMMUNICATIONS ACM
[3]  
[Anonymous], P COLL WEB TAGG WORK
[4]  
[Anonymous], DSTORR0278 AUSTR GOV
[5]  
[Anonymous], COMMUNICATIONS ACM
[6]  
[Anonymous], 2002, WORDSIMILARITY 353 T
[7]  
[Anonymous], 2008, P 2008 C EMPIRICAL M
[8]  
[Anonymous], P 19 ANN INT ACM SIG
[9]  
[Anonymous], 2001, P 12 EUR C MACH LEAR, DOI DOI 10.1007/3-540-44795-4_42
[10]  
[Anonymous], P 17 ACM INT C RES D