Poisoning medical knowledge using large language models

被引:7
作者
Yang, Junwei [1 ]
Xu, Hanwen [2 ]
Mirzoyan, Srbuhi [1 ]
Chen, Tong [2 ]
Liu, Zixuan [2 ]
Liu, Zequn [1 ]
Ju, Wei [1 ]
Liu, Luchen [1 ]
Xiao, Zhiping [2 ]
Zhang, Ming [1 ]
Wang, Sheng [2 ]
机构
[1] Peking Univ, Sch Comp Sci, Anker Embodied AI Lab, State Key Lab Multimedia Informat Proc, Beijing, Peoples R China
[2] Univ Washington, Paul G Allen Sch Comp Sci & Engn, Seattle, WA 98195 USA
基金
中国国家自然科学基金;
关键词
D O I
10.1038/s42256-024-00899-3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Biomedical knowledge graphs (KGs) constructed from medical literature have been widely used to validate biomedical discoveries and generate new hypotheses. Recently, large language models (LLMs) have demonstrated a strong ability to generate human-like text data. Although most of these text data have been useful, LLM might also be used to generate malicious content. Here, we investigate whether it is possible that a malicious actor can use an LLM to generate a malicious paper that poisons medical KGs and further affects downstream biomedical applications. As a proof of concept, we develop Scorpius, a conditional text-generation model that generates a malicious paper abstract conditioned on a promoted drug and a target disease. The goal is to fool the medical KG constructed from a mixture of this malicious abstract and millions of real papers so that KG consumers will misidentify this promoted drug as relevant to the target disease. We evaluated Scorpius on a KG constructed from 3,818,528 papers and found that Scorpius can increase the relevance of 71.3% drug-disease pairs from the top 1,000 to the top ten by adding only one malicious abstract. Moreover, the generation of Scorpius achieves better perplexity than ChatGPT, suggesting that such malicious abstracts cannot be efficiently detected by humans. Collectively, Scorpius demonstrates the possibility of poisoning medical KGs and manipulating downstream applications using LLMs, indicating the importance of accountable and trustworthy medical knowledge discovery in the era of LLMs.
引用
收藏
页码:1156 / 1168
页数:13
相关论文
共 72 条
  • [1] Achiam J., 2023, arXiv
  • [2] Ahamed S., 2020, ARXIV
  • [3] Methods, preprints and papers
    不详
    [J]. NATURE BIOTECHNOLOGY, 2017, 35 (12) : 1113 - 1113
  • [4] [Anonymous], 2016, NAT METHODS, V13, P277
  • [5] [Anonymous], 1999, Technical report
  • [6] Using drug descriptions and molecular structures for drug-drug interaction extraction from literature
    Asada, Masaki
    Miwa, Makoto
    Sasaki, Yutaka
    [J]. BIOINFORMATICS, 2021, 37 (12) : 1739 - 1746
  • [7] Betz P, 2022, PROCEEDINGS OF THE THIRTY-FIRST INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2022, P2820
  • [8] Bhardwaj P, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P8225
  • [9] Bhardwaj P, 2021, 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), P1875
  • [10] Bianchini M., 2005, ACM Transactions on Internet Technology, V5, P92, DOI 10.1145/1052934.1052938