A new Approach for Similarity Search based on Textual Content

被引:0
作者
Uwimana, Clotilde [1 ]
Wu, Renyong [1 ]
机构
[1] Hunan Univ, Changsha 410082, Hunan, Peoples R China
来源
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION MANAGEMENT AND EVALUATION | 2010年
关键词
Web searching; search engines; similarity search;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Today, the ultimate way for people to search and locate information is the use of search engines. Though web searching has been a great success and an effective manner for retrieving information, yet methods for retrieving different kinds of information are needed for various applications. In each application, search can be done based on the hyperlink structure or textual content. To save the user from a tedious work of browsing or reading through an entire collection when looking for similar objects to the query object, our work is based on analyzing the content of pages to retrieve similar objects. Similarity search refers to searching for objects similar to a query object. Given a user query, which is an object, the system searches through the web to find similar objects that are relevant, meaning objects having common attributes or properties with the query object. The challenge is how to determine those attributes from a large collection of documents with non well-structured information. We evaluate the importance of a word to a document in a collection using the well-known term frequency-inverse document frequency (tf-idf) weight. It allows selecting the top k-terms that are deemed to be the common attributes. Those terms are then used as the subsequent query performed by the system to get the final results. However, not only the weighting of terms will allow us to get similar objects, we also need to carry a check on the results returned by the k-terms query in order to eliminate documents that are more relevant to the initial query object since we are looking for results of similar objects rather than results of initial query object. This paper presents a new approach for similarity search with two algorithms that allows obtaining similarity search results based on textual content of search results through the selection of top k-terms that represent the common attributes of the initial query and its similar objects. The proposed approach is therefore used as an "in-between" processing step that grants the user a direct way to get the similarity search results.
引用
收藏
页码:399 / 404
页数:6
相关论文
共 13 条
  • [1] [Anonymous], 2009, INTRO INFORM RETRIEV
  • [2] [Anonymous], 2005, P 14 INT C WORLD WID, DOI DOI 10.1145/1060745.1060839
  • [3] Ashwin T. V., 2002, Proceedings of the Twenty-eighth International Conference on Very Large Data Bases, P47
  • [4] CHURCHILL C, 2009, SEARCH ENGINE ALGORI
  • [5] Finding related pages in the World Wide Web
    Dean, J
    Henzinger, MR
    [J]. COMPUTER NETWORKS, 1999, 31 (11-16) : 1467 - 1479
  • [6] Dong X., 2004, Proceedings of the 30th International Conference on Very Large Data Bases (VLDB'04), V30, P372, DOI DOI 10.1016/B978-012088469-8.50035-8
  • [7] Haveliwala T.H., 2002, WWW 02, P432
  • [8] HIRSCH L, 2007, P 9 ANN C GEN EV COM, P1604
  • [9] Lawrence S., 2000, IEEE DATA ENG B, V23, P25
  • [10] LIFSHITS Y, 2007, GOOGLE TECH TAL 1018