HSS: A Hierarchical Semantic Similarity Hard Negative Sampling Method for Dense Retrievers

被引：0

作者：

Xie, Xinjia y ^{[1
]}

Liu, Feng ^{[1
]}

Gai, Shun ^{[1
]}

Huang, Zhen ^{[1
]}

Hu, Minghao ^{[2
]}

Wang, Ankun ^{[1
]}

机构：

[1] Natl Univ Def Technol, Natl Key Lab Parallel & Distributed Proc, Changsha 410000, Peoples R China

[2] Informat Res Ctr Mil Sci, Beijing 100000, Peoples R China

来源：

MULTIMEDIA MODELING, MMM 2023, PT II | 2023年 / 13834卷

关键词：

Negative sampling; Dense Retriever; Semantic Similarity; Open-domain question answering;

D O I：

10.1007/978-3-031-27818-1_25

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Dense Retriever (DR) for Open-domain textual question answering (OpenQA), which aims to retrieve passages from large data sources like Wikipedia or Google, has gained wide attention in recent years. Although DR models continuously refresh state-of-the-art performances, their improvement relies on negative sampling during the training process. Existing sampling strategies mainly focus on developing a complex algorithm based on computer science, and ignore the abundant semantic features of datasets. We discover that there exists obvious changes in semantic similarity and present a three-level hierarchy of semantic similarity: same topic, same class, other class, whose rationality is further demonstrated by ablation study. Based on this, we propose a hard negative sampling strategy named Hierarchical Semantic Similarity (HSS). Our HSS model performs negative sampling at semantic levels of topic and class, and experimental results on four datasets show that it achieves comparable or better retrieval performance compared with existing competitive baselines. The code is available in https://github.com/redirecttttt/HSS.

引用

页码：301 / 312

页数：12

共 21 条

[1] Modeling of the Question Answering Task in the YodaQA System
Baudis, Petr
Sedivy, Jan
[J]. EXPERIMENTAL IR MEETS MULTILINGUALITY, MULTIMODALITY, AND INTERACTION, 2015, 9283 : 222 - 228
[2] Berant Jonathan, 2013, P 2013 C EMPIRICAL M, P1533
[3] Latent Dirichlet allocation
Blei, DM
Ng, AY
Jordan, MI
[J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
[4] Dolan W., 2004, P 20 INT C COMPUTATI
[5] Gillick Dan, 2019, P 23 C COMP NAT LANG, P528, DOI [DOI 10.18653/V1/K19-1049, DOI 10.18653/V1, 10.18653/v1/K19-1049]
[6] Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling
Hofstatter, Sebastian
Lin, Sheng-Chieh
Yang, Jheng-Hong
Lin, Jimmy
Hanbury, Allan
[J]. SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 113 - 122
[7] Hong Xuan, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12359), P126, DOI 10.1007/978-3-030-58568-6_8
[8] Billion-Scale Similarity Search with GPUs
Johnson, Jeff
Douze, Matthijs
Jegou, Herve
[J]. IEEE TRANSACTIONS ON BIG DATA, 2021, 7 (03) : 535 - 547
[9] TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Joshi, Mandar
Choi, Eunsol
Weld, Daniel S.
Zettlemoyer, Luke
[J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 1601 - 1611
[10] Karpukhin V, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P6769

← 1 2 3 →