Identifying Patient Populations in Texts Describing Drug Approvals Through Deep Learning-Based Information Extraction: Development of a Natural Language Processing Algorithm

被引:0
作者
Gendrin, Aline [1 ,5 ]
Souliotis, Leonidas [1 ]
Loudon-Griffiths, James [1 ]
Aggarwal, Ravisha [2 ]
Amoako, Daniel [3 ]
Desouza, Gregory [1 ]
Dimitrievska, Sashka [4 ]
Metcalfe, Paul [1 ]
Louvet, Emilie [1 ]
Sahni, Harpreet [3 ]
机构
[1] AstraZeneca, Cambridge, England
[2] AstraZeneca, Bangalore, India
[3] AstraZeneca, Wilmington, DE USA
[4] AstraZeneca, Gaithersburg, MD USA
[5] AstraZeneca, City House,126-130 Hills Rd, Cambridge CB2 1RY, England
关键词
algorithm; artificial intelligence; BERT; cancer; classification; data extraction; data mining; deep-learning; development; drug approval; free text; information retrieval; line of therapy; machine learning; natural language processing; NLP; oncology; pharmaceutic; pharmacology; pharmacy; stage of cancer; text extraction; text mining; unstructured data;
D O I
10.2196/44876
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: New drug treatments are regularly approved, and it is challenging to remain up-to-date in this rapidly changing environment. Fast and accurate visualization is important to allow a global understanding of the drug market. Automation of this information extraction provides a helpful starting point for the subject matter expert, helps to mitigate human errors, and saves time.Objective: We aimed to semiautomate disease population extraction from the free text of oncology drug approval descriptions from the BioMedTracker database for 6 selected drug targets. More specifically, we intended to extract (1) line of therapy, (2) stage of cancer of the patient population described in the approval, and (3) the clinical trials that provide evidence for the approval. We aimed to use these results in downstream applications, aiding the searchability of relevant content against related drug project sources.Methods: We fine-tuned a state-of-the-art deep learning model, Bidirectional Encoder Representations from Transformers, for each of the 3 desired outputs. We independently applied rule-based text mining approaches. We compared the performances of deep learning and rule-based approaches and selected the best method, which was then applied to new entries. The results were manually curated by a subject matter expert and then used to train new models. Results: The training data set is currently small (433 entries) and will enlarge over time when new approval descriptions become available or if a choice is made to take another drug target into account. The deep learning models achieved 61% and 56% 5-fold cross-validated accuracies for line of therapy and stage of cancer, respectively, which were treated as classification tasks. Trial identification is treated as a named entity recognition task, and the 5-fold cross-validated F1-score is currently 87%. Although the scores of the classification tasks could seem low, the models comprise 5 classes each, and such scores are a marked improvement when compared to random classification. Moreover, we expect improved performance as the input data set grows, since deep learning models need to be trained on a large enough amount of data to be able to learn the task they are taught. The rule-based approach achieved 60% and 74% 5-fold cross-validated accuracies for line of therapy and stage of cancer, respectively. No attempt was made to define a rule-based approach for trial identification.Conclusions: We developed a natural language processing algorithm that is currently assisting subject matter experts in disease population extraction, which supports health authority approvals. This algorithm achieves semiautomation, enabling subject matter experts to leverage the results for deeper analysis and to accelerate information retrieval in a crowded clinical environment such as oncology.
引用
收藏
页数:12
相关论文
共 54 条
[1]  
Adamic L.A., 2008, Proceedings of the 17th international conference on World Wide Web (WWW '08), P665, DOI [10.1145/1367497.1367587, DOI 10.1145/1367497.1367587]
[2]  
[Anonymous], DAT WNUT 17
[3]  
[Anonymous], 2011, P 49 ANN M ASS COMPU
[4]  
[Anonymous], DAT IMDB
[5]  
[Anonymous], DAT DBPED 14
[6]  
[Anonymous], DAT TREC
[7]  
[Anonymous], DATASETS NEWSGROUP
[8]  
[Anonymous], DAT CONLL2003
[9]  
[Anonymous], TEXT CLASS
[10]  
[Anonymous], DAT NCBI DIS