Sensitive Topics Retrieval in Digital Libraries: A Case Study of hadit collections

被引：0

作者：

Sullutrone, Giovanni ^{[1
,2
]}

Vigliermo, Riccardo Amerigo ^{[1
,3
]}

Sala, Luca ^{[1
,2
]}

Bergamaschi, Sonia ^{[1
,2
]}

机构：

[1] Univ Modena Reggio Emilia UNIMORE, Modena, Italy

[2] UNIMORE, DBGrp, Modena, Italy

[3] Fdn Sci Religiose FSCIRE, Bologna, Italy

来源：

LINKING THEORY AND PRACTICE OF DIGITAL LIBRARIES, PT II, TPDL 2024 | 2024年 / 15178卷

关键词：

Retrieval-Augmented Generation; Bias; Digital Libraries; Sensitive Topics; Islamic studies; hadit collections;

D O I：

10.1007/978-3-031-72440-4_5

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The advent of Large Language Models (LLMs) has led to the development of new Question-Answering (QA) systems based on Retrieval-Augmented Generation (RAG) to incorporate query-specific knowledge at inference time. In this paper, the trustworthiness of RAG systems is investigated, particularly focusing on the performance of their retrieval phase when dealing with sensitive topics. This issue is particularly relevant as it could hinder a user's ability to analyze sections of the available corpora, effectively biasing any following research. To mimic a specialised library possibly containing sensitive topics, a hadit dataset has been curated using an ad-hoc framework called Question-Classify-Retrieve (QCR), which automatically assesses the performance of document retrieval by operating in three main steps: Question Generation, Passage Classification, and Passage Retrieval. Different sentence embedding models for document retrieval were tested showing significant performance gap between sensitive and non-sensitive topics compared to baseline. In real-world applications this would mean relevant documents placed lower in the retrieval list leading to the presence of irrelevant information or the absence of relevant one in case of a lower cut-off.

引用

页码：51 / 62

页数：12

共 41 条

[21]

Martoglia R., 2023, CEUR WORKSHOP P, V3365, P153

[22]

Martoglia Riccardo, 2022, CEUR WORKSHOP P, V3234

[23] Bias Against 93 Stigmatized Groups in Masked Language Models and Downstream Sentiment Classification Tasks [J].

Mei, Katelyn X. ;

Fereidooni, Sonia ;

Caliskan, Aylin .

PROCEEDINGS OF THE 6TH ACM CONFERENCE ON FAIRNESS, ACCOUNTABILITY, AND TRANSPARENCY, FACCT 2023, 2023, :1699-1710

[24]

Merrick L, 2024, Arxiv, DOI arXiv:2405.05374

[25]

meta, LLama-3: LLama-3-8B-Instruct 4b-quantized

[26] Hate speech detection and racial bias mitigation in social media based on BERT model [J].

Mozafari, Marzieh ;

Farahbakhsh, Reza ;

Crespi, Noel .

PLOS ONE, 2020, 15 (08)

[27]

Muennighoff N, 2022, Arxiv, DOI [arXiv:2210.07316, DOI 10.48550/ARXIV.2210.07316]

[28]

OpenAI, Embeddings-OpenAI API

[29]

OpenAI, 2024, GPT-4 Technical Report

[30]

referenceworks, Encyclopaedia of Islam

← 1 2 3 4 5 →