Entity Synonyms for Structured Web Search

被引:22
作者
Cheng, Tao [1 ]
Lauw, Hady W. [2 ]
Paparizos, Stelios [3 ]
机构
[1] Microsoft Res Redmond, Redmond, WA 98052 USA
[2] Inst Infocomm Res, Singapore, Singapore
[3] Microsoft Res Silicon Valley, Mountain View, CA 94043 USA
关键词
Entity synonym; fuzzy matching; structured data; web query; query log;
D O I
10.1109/TKDE.2011.168
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Nowadays, there are many queries issued to search engines targeting at finding values from structured data (e. g., movie showtime of a specific location). In such scenarios, there is often a mismatch between the values of structured data (how content creators describe entities) and the web queries (how different users try to retrieve them). Therefore, recognizing the alternative ways people use to reference an entity, is crucial for structured web search. In this paper, we study the problem of automatic generation of entity synonyms over structured data toward closing the gap between users and structured data. We propose an offline, data-driven approach that mines query logs for instances where content creators and web users apply a variety of strings to refer to the same webpages. This way, given a set of strings that reference entities, we generate an expanded set of equivalent strings (entity synonyms) for each entity. Our framework consists of three modules: candidate generation, candidate selection, and noise cleaning. We further study the cause of the problem through the identification of different entity synonym classes. The proposed method is verified with experiments on real-life data sets showing that we can significantly increase the coverage of structured web queries with good precision.
引用
收藏
页码:1862 / 1875
页数:14
相关论文
共 33 条
[1]  
Agrawal S., 2002, P 18 INT C DAT ENG I
[2]  
[Anonymous], P 32 INT ACM SIGIR C
[3]  
[Anonymous], 2012, MSN SHOPP XML DAT AC
[4]  
[Anonymous], 2007, ACM Transactions on Knowledge Discovery from Data (TKDD), DOI [DOI 10.1145/1217299.1217304, 10.1145/1217299.1217304]
[5]  
[Anonymous], P ACM SIGMOD INT C M
[6]  
[Anonymous], P 12 EUR C MACH LEAR
[7]  
[Anonymous], 2006, Proceedings of the 15th International Conference on World Wide Web (WWW '06), DOI DOI 10.1145/1135777.1135835
[8]  
Antonellis I., 2008, P INT C VER LARG DAT
[9]  
Baeza-Yates R., 2004, EDBT 04 P INT C CURR
[10]  
Baeza-Yates Ricardo, 2007, P 13 ACM SIGKDD INT