Concordance-based entity-oriented search

被引:2
作者
Bautin, Mikhail [1 ]
Skiena, Steven [1 ]
机构
[1] Department of Computer Science, Stony Brook University, Stony Brook
来源
Web Intelligence and Agent Systems | 2009年 / 7卷 / 04期
关键词
Entity search; Natural language processing; News and blog analysis; Text processing; Web search;
D O I
10.3233/WIA-2009-0170
中图分类号
学科分类号
摘要
We consider the problem of finding relevant named entities in response to a search query over a given text corpus. Entity search can readily be used to augment conventional web search engines for a variety of applications. We use entity concordance documents to generate lists of relevant entities for arbitrary text queries. To assess the significance of entity search, we analyzed the AOL dataset of 36 million web search queries with respect to two different sets of entities: namely (a) 2.3 million distinct entities extracted from a news text corpus and (b) 2.9 million Wikipedia article titles. The results clearly indicate that search engines should be aware of entities, for under various criteria of matching between 18-39% of all web search queries can be recognized as specifically searching for entities, while 73-87% of all queries contain entities. Our entity search engine creates a concordance document for each entity, consisting of all the sentences in the corpus containing that entity. We then index and search these documents using open-source search software. This gives a ranked list of entities as the result of search. Visit http://www.textmap.com for a demonstration of our entity search engine over a large news corpus. In the case where the query is a named entity, we evaluate the performance of our system by comparing the results of our search engine to the list of entities that have highest statistical juxtaposition scores with the queried entity. Juxtaposition score is a measure of how strongly two entities are related in terms of a probabilistic upper bound. The results show excellent performance, particularly over well-characterized classes of entities such as people. © 2009 - IOS Press and the authors. All rights reserved.
引用
收藏
页码:303 / 319
页数:16
相关论文
共 34 条
  • [1] Broder A., A taxonomy of web search, ACM Special Interest Group on Information Retrieval (SIGIR) Forum, 36, 2, pp. 3-10, (2002)
  • [2] Dijck P.V., Better Search Engine Design: Beyond Algorithms, (2003)
  • [3] Arrington M., AOL Proudly Releases Massive Amounts of Private Data, (2006)
  • [4] Lloyd L., Kechagias D., Skiena S., Lydia: A system for large-scale news analysis, Proceedings of 12th International Conference on String Processing and Information Retrieval, pp. 161-166, (2005)
  • [5] Bautin M., Skiena S., Concordance-based entity-oriented search, Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, pp. 586-592, (2007)
  • [6] Godbole N., Srinivasaiah M., Skiena S., Large-scale sentiment analysis for news and blogs, Proceedings of International Conference on Weblogs and Social Media, (2007)
  • [7] Kil J.H., Lloyd L., Skiena S., Question answering with Lydia, The 14th Text Retrieval Conference (TREC) Proceedings, (2005)
  • [8] Lloyd L., Mehler A., Skiena S., Identifying co-referential names across large corpora, Proceedings of the 17th Annual Symposium on Combinatorial Pattern Matching (CPM 2006), pp. 12-23, (2006)
  • [9] Mehler A., Bao Y., Li X., Wang Y., Skiena S., Spatial analysis of news sources, IEEE Transactions on Visualization and Computer Graphics, 12, pp. 765-772, (2006)
  • [10] Chu-Carroll J., Prager J., Czuba K., Ferrucci D., Duboue P., Semantic search via XML fragments: A highprecision approach to IR, SIGIR '06: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 445-452, (2006)