Sensitive Topics Retrieval in Digital Libraries: A Case Study of hadit collections

被引:0
作者
Sullutrone, Giovanni [1 ,2 ]
Vigliermo, Riccardo Amerigo [1 ,3 ]
Sala, Luca [1 ,2 ]
Bergamaschi, Sonia [1 ,2 ]
机构
[1] Univ Modena Reggio Emilia UNIMORE, Modena, Italy
[2] UNIMORE, DBGrp, Modena, Italy
[3] Fdn Sci Religiose FSCIRE, Bologna, Italy
来源
LINKING THEORY AND PRACTICE OF DIGITAL LIBRARIES, PT II, TPDL 2024 | 2024年 / 15178卷
关键词
Retrieval-Augmented Generation; Bias; Digital Libraries; Sensitive Topics; Islamic studies; hadit collections;
D O I
10.1007/978-3-031-72440-4_5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The advent of Large Language Models (LLMs) has led to the development of new Question-Answering (QA) systems based on Retrieval-Augmented Generation (RAG) to incorporate query-specific knowledge at inference time. In this paper, the trustworthiness of RAG systems is investigated, particularly focusing on the performance of their retrieval phase when dealing with sensitive topics. This issue is particularly relevant as it could hinder a user's ability to analyze sections of the available corpora, effectively biasing any following research. To mimic a specialised library possibly containing sensitive topics, a hadit dataset has been curated using an ad-hoc framework called Question-Classify-Retrieve (QCR), which automatically assesses the performance of document retrieval by operating in three main steps: Question Generation, Passage Classification, and Passage Retrieval. Different sentence embedding models for document retrieval were tested showing significant performance gap between sensitive and non-sensitive topics compared to baseline. In real-world applications this would mean relevant documents placed lower in the retrieval list leading to the presence of irrelevant information or the absence of relevant one in case of a lower cut-off.
引用
收藏
页码:51 / 62
页数:12
相关论文
共 40 条
[1]  
Allport G.W., 1963, Taboo Topics
[2]  
asan Nas. ar. al-turath alc arabi, 2001, Muh. ammad Murtad. a al-. useiynHi al-Zubaiydi. Tag. alc arus min gawhar. al-qaums. ar
[3]   Novel Perspectives for the Management of Multilingual and Multialphabetic Heritages through Automatic Knowledge Extraction: The DigitalMaktaba Approach [J].
Bergamaschi, Sonia ;
De Nardis, Stefania ;
Martoglia, Riccardo ;
Ruozzi, Federico ;
Sala, Luca ;
Vanzini, Matteo ;
Vigliermo, Riccardo Amerigo .
SENSORS, 2022, 22 (11)
[4]  
Bergamaschi S, 2021, PROCEEDINGS OF THE 2021 CONFERENCE ON INFORMATION TECHNOLOGY FOR SOCIAL GOOD, GOODIT 2021, P301, DOI 10.1145/3462203.3475927
[5]  
Brown TB, 2020, ADV NEUR IN, V33
[6]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7]  
DicksonSwift V, 2008, UNDERTAKING SENSITIVE RESEARCH IN THE HEALTH AND SOCIAL SCIENCES: MANAGING BOUNDARIES, EMOTIONS AND RISKS, P1, DOI 10.1017/CBO9780511545481
[8]  
El Ganadi A., 2023, CEUR Workshop Proceedings, V3536, P21
[9]  
Gallegos IO, 2024, Arxiv, DOI arXiv:2309.00770
[10]  
Gao LY, 2023, PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, P1762