Embedding based learning for collection selection in federated search

被引：6

作者：

Garba, Adamu ^{[1
]}

Khalid, Shah ^{[2
,3
]}

Ullah, Irfan ^{[4
]}

Khusro, Shah ^{[4
]}

Mumin, Diyawu ^{[3
,5
]}

机构：

[1] Jiangsu Univ, Sch Comp & Commun Engn, Zhenjiang, Jiangsu, Peoples R China

[2] Natl Univ Sci & Technol, Sch Elect Engn & Comp Sci SEECS, Islamabad, Pakistan

[3] Jiangsu Univ, Sch Comp Sci & Commun Engn, Zhenjiang, Jiangsu, Peoples R China

[4] Univ Peshawar, Dept Comp Sci, Peshawar, Pakistan

[5] Tamale Tech Univ, Comp Sci, Tamale, Ghana

来源：

DATA TECHNOLOGIES AND APPLICATIONS | 2020年 / 54卷 / 05期

关键词：

Federated search; Distributed information retrieval; Collection selection; Word embedding; Word2vec;

D O I：

10.1108/DTA-01-2019-0005

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Purpose There have been many challenges in crawling deep web by search engines due to their proprietary nature or dynamic content. Distributed Information Retrieval (DIR) tries to solve these problems by providing a unified searchable interface to these databases. Since a DIR must search across many databases, selecting a specific database to search against the user query is challenging. The challenge can be solved if the past queries of the users are considered in selecting collections to search in combination with word embedding techniques. Combining these would aid the best performing collection selection method to speed up retrieval performance of DIR solutions. Design/methodology/approach The authors propose a collection selection model based on word embedding using Word2Vec approach that learns the similarity between the current and past queries. They used the cosine and transformed cosine similarity models in computing the similarities among queries. The experiment is conducted using three standard TREC testbeds created for federated search. Findings The results show significant improvements over the baseline models. Originality/value Although the lexical matching models for collection selection using similarity based on past queries exist, to the best our knowledge, the proposed work is the first of its kind that uses word embedding for collection selection by learning from past queries.

引用

页码：703 / 717

页数：15

共 34 条

[1]

[Anonymous], LECT NOTES COMPUTER

[2]

Arguello J., 2009, P CIKM 09 NOV 02 06

[3] The FedLemur project: Federated search in the real world [J].

Avrahami, TT ;

Yau, L ;

Si, L ;

Callan, J .

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2006, 57 (03) :347-358

[4] Query-based sampling of text databases [J].

Callan, J ;

Connell, M .

ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2001, 19 (02) :97-130

[5]

Callan J., 2000, ADV INFORM RETRIEVAL, P127

[6]

Cetintas S., 2009, ACM C HONG KONG CHIN

[7]

Craswell N., 2017, P SIGIR 17 AUG 07 11

[8] Collection selection for managed distributed document databases [J].

D'Souza, D ;

Thom, JA ;

Zobel, J .

INFORMATION PROCESSING & MANAGEMENT, 2004, 40 (03) :527-546

[9]

DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO

[10]

2-9

← 1 2 3 4 →