HSS: A Hierarchical Semantic Similarity Hard Negative Sampling Method for Dense Retrievers

被引:0
作者
Xie, Xinjia y [1 ]
Liu, Feng [1 ]
Gai, Shun [1 ]
Huang, Zhen [1 ]
Hu, Minghao [2 ]
Wang, Ankun [1 ]
机构
[1] Natl Univ Def Technol, Natl Key Lab Parallel & Distributed Proc, Changsha 410000, Peoples R China
[2] Informat Res Ctr Mil Sci, Beijing 100000, Peoples R China
来源
MULTIMEDIA MODELING, MMM 2023, PT II | 2023年 / 13834卷
关键词
Negative sampling; Dense Retriever; Semantic Similarity; Open-domain question answering;
D O I
10.1007/978-3-031-27818-1_25
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dense Retriever (DR) for Open-domain textual question answering (OpenQA), which aims to retrieve passages from large data sources like Wikipedia or Google, has gained wide attention in recent years. Although DR models continuously refresh state-of-the-art performances, their improvement relies on negative sampling during the training process. Existing sampling strategies mainly focus on developing a complex algorithm based on computer science, and ignore the abundant semantic features of datasets. We discover that there exists obvious changes in semantic similarity and present a three-level hierarchy of semantic similarity: same topic, same class, other class, whose rationality is further demonstrated by ablation study. Based on this, we propose a hard negative sampling strategy named Hierarchical Semantic Similarity (HSS). Our HSS model performs negative sampling at semantic levels of topic and class, and experimental results on four datasets show that it achieves comparable or better retrieval performance compared with existing competitive baselines. The code is available in https://github.com/redirecttttt/HSS.
引用
收藏
页码:301 / 312
页数:12
相关论文
共 21 条
  • [1] Modeling of the Question Answering Task in the YodaQA System
    Baudis, Petr
    Sedivy, Jan
    [J]. EXPERIMENTAL IR MEETS MULTILINGUALITY, MULTIMODALITY, AND INTERACTION, 2015, 9283 : 222 - 228
  • [2] Berant Jonathan, 2013, P 2013 C EMPIRICAL M, P1533
  • [3] Latent Dirichlet allocation
    Blei, DM
    Ng, AY
    Jordan, MI
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
  • [4] Dolan W., 2004, P 20 INT C COMPUTATI
  • [5] Gillick Dan, 2019, P 23 C COMP NAT LANG, P528, DOI [DOI 10.18653/V1/K19-1049, DOI 10.18653/V1, 10.18653/v1/K19-1049]
  • [6] Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling
    Hofstatter, Sebastian
    Lin, Sheng-Chieh
    Yang, Jheng-Hong
    Lin, Jimmy
    Hanbury, Allan
    [J]. SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 113 - 122
  • [7] Hong Xuan, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12359), P126, DOI 10.1007/978-3-030-58568-6_8
  • [8] Billion-Scale Similarity Search with GPUs
    Johnson, Jeff
    Douze, Matthijs
    Jegou, Herve
    [J]. IEEE TRANSACTIONS ON BIG DATA, 2021, 7 (03) : 535 - 547
  • [9] TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
    Joshi, Mandar
    Choi, Eunsol
    Weld, Daniel S.
    Zettlemoyer, Luke
    [J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 1601 - 1611
  • [10] Karpukhin V, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P6769