CoQUAD: a COVID-19 question answering dataset system, facilitating research, benchmarking, and practice

被引:20
作者
Raza, Shaina [1 ,2 ]
Schwartz, Brian [1 ,2 ]
Rosella, Laura C. [1 ,2 ]
机构
[1] Publ Hlth Ontario PHO, Toronto, ON, Canada
[2] Univ Toronto, Dalla Lana Sch Publ Hlth, Toronto, ON, Canada
基金
加拿大健康研究院;
关键词
COVID-19; Transformer model; Question answering system; Pipeline; CORD-19; LitCOVID; Long-COVID; Post-COVID-19;
D O I
10.1186/s12859-022-04751-6
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background Due to the growing amount of COVID-19 research literature, medical experts, clinical scientists, and researchers frequently struggle to stay up to date on the most recent findings. There is a pressing need to assist researchers and practitioners in mining and responding to COVID-19-related questions on time. Methods This paper introduces CoQUAD, a question-answering system that can extract answers related to COVID-19 questions in an efficient manner. There are two datasets provided in this work: a reference-standard dataset built using the CORD-19 and LitCOVID initiatives, and a gold-standard dataset prepared by the experts from a public health domain. The CoQUAD has a Retriever component trained on the BM25 algorithm that searches the reference-standard dataset for relevant documents based on a question related to COVID-19. CoQUAD also has a Reader component that consists of a Transformer-based model, namely MPNet, which is used to read the paragraphs and find the answers related to a question from the retrieved documents. In comparison to previous works, the proposed CoQUAD system can answer questions related to early, mid, and post-COVID-19 topics. Results Extensive experiments on CoQUAD Retriever and Reader modules show that CoQUAD can provide effective and relevant answers to any COVID-19-related questions posed in natural language, with a higher level of accuracy. When compared to state-of-the-art baselines, CoQUAD outperforms the previous models, achieving an exact match ratio score of 77.50% and an F1 score of 77.10%. Conclusion CoQUAD is a question-answering system that mines COVID-19 literature using natural language processing techniques to help the research community find the most recent findings and answer any related questions.
引用
收藏
页数:28
相关论文
共 72 条
[1]  
Aggarwal C.C., 2015, Data mining: the textbook, V1
[2]   Long COVID, a comprehensive systematic scoping review [J].
Akbarialiabad, Hossein ;
Taghrir, Mohammad Hossein ;
Abdollahi, Ashkan ;
Ghahramani, Nasrollah ;
Kumar, Manasi ;
Paydar, Shahram ;
Razani, Babak ;
Mwangi, John ;
Asadi-Pooya, Ali A. ;
Malekmakan, Leila ;
Bastani, Bahar .
INFECTION, 2021, 49 (06) :1163-1186
[3]   COBERT: COVID-19 Question Answering System Using BERT [J].
Alzubi, Jafar A. ;
Jain, Rachna ;
Singh, Anubhav ;
Parwekar, Pritee ;
Gupta, Meenu .
ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2023, 48 (08) :11003-11013
[4]  
[Anonymous], 1999, MODERN INFORM RETRIE, DOI DOI 10.1145/553876
[5]  
[Anonymous], 2008, Introduction to information retrieval
[6]  
[Anonymous], 2020, World Health Organization Retrieved from Archived: WHO Timeline-COVID-19
[7]   A study on different closed domain question answering approaches [J].
Badugu, Srinivasu ;
Manivannan, R. .
INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2020, 23 (02) :315-325
[8]  
Beltagy I., 2020, Longformer: The long-document transformer, DOI DOI 10.48550/ARXIV.2004.05150
[9]   Question Answering Systems: Survey and Trends [J].
Bouziane, Abdelghani ;
Bouchiha, Djelloul ;
Doumi, Noureddine ;
Malki, Mimoun .
INTERNATIONAL CONFERENCE ON ADVANCED WIRELESS INFORMATION AND COMMUNICATION TECHNOLOGIES (AWICT 2015), 2015, 73 :366-375
[10]   NATIONAL STANDARD REFERENCE DATA SYSTEM [J].
BRADY, EL ;
WALLENSTEIN, MB .
SCIENCE, 1967, 156 (3776) :754-+