SUSIE: Pharmaceutical CMC ontology-based information extraction for drug machine

被引:6
作者
Mann, Vipul [1 ]
Viswanath, Shekhar [2 ]
Vaidyaraman, Shankar [2 ]
Balakrishnan, Jeya [2 ]
Venkatasubramanian, Venkat [1 ]
机构
[1] Columbia Univ, Dept Chem Engn, New York, NY 10027 USA
[2] Eli Lilly & Co, Lilly Corp Ctr, Indianapolis, IN USA
关键词
Ontology; Pharmaceutical drug development; Information extraction; Hybrid machine learning; Chemistry manufacturing and control; SYSTEM; RECOGNITION;
D O I
10.1016/j.compchemeng.2023.108446
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Automatically extracting information from unstructured text in pharmaceutical documents is important for drug discovery and development. This information can be integrated with structured datasets to ultimately accelerate pharmaceutical product development. To this end, we report an end-to-end information extraction framework based on a custom-built pharmaceutical drug development ontology, a weak supervision framework, contextualization algorithms, and a fine-tuned BioBERT model (adaptation of BERT or Bidirectional Encoder Representations from Transformers for biomedical text). The proposed framework, SUSIE (Schema-based Un-supervised Semantic Information Extraction), was trained on ICH (International Conference on Harmonization) documents to identify important entities and relations from unstructured text and auto-generate knowledge graphs representing crucial information in a structured format. On the entity identification task, the framework achieves a test accuracy and F1-score of 96% and 88%, respectively, on out-of-sample documents. A major contribution of this work is to build an automated, unsupervised information extraction framework around a domain-specific, custom-built pharmaceutical drug development ontology without the need for manual curation of training datasets for specific tasks. The efficacy of the approach was tested on out-of-sample documents including an internal Eli Lilly technical document.
引用
收藏
页数:15
相关论文
共 64 条
  • [31] Mann V., 2023, COMPUTER AIDED CHEM, V52, P221, DOI DOI 10.1016/B978-0-443-15274-0.50036-6
  • [32] Group contribution-based property modeling for chemical product design: A perspective in the AI era
    Mann, Vipul
    Gani, Rafiqul
    Venkatasubramanian, Venkat
    [J]. FLUID PHASE EQUILIBRIA, 2023, 568
  • [33] AI-driven hypergraph network of organic chemistry: network statistics and applications in reaction classification
    Mann, Vipul
    Venkatasubramanian, Venkat
    [J]. REACTION CHEMISTRY & ENGINEERING, 2023, 8 (03) : 619 - 635
  • [34] Hybrid, Interpretable Machine Learning for Thermodynamic Property Estimation using Grammar2vec for Molecular Representation
    Mann, Vipul
    Brito, Karoline
    Gani, Rafiqul
    Venkatasubramanian, Venkat
    [J]. FLUID PHASE EQUILIBRIA, 2022, 561
  • [35] Retrosynthesis pre diction using grammar-base d neural machine translation: An information-theoretic approach
    Mann, Vipul
    Venkatasubramanian, Venkat
    [J]. COMPUTERS & CHEMICAL ENGINEERING, 2021, 155
  • [36] Predicting chemical reaction outcomes: A grammar ontology-based transformer framework
    Mann, Vipul
    Venkatasubramanian, Venkat
    [J]. AICHE JOURNAL, 2021, 67 (03)
  • [37] Medina C. P., 2003, P 1 INSTR C MACH LEA, V242, P133
  • [38] Musen Mark A, 2015, AI Matters, V1, P4
  • [39] Perovskite-based electrocatalyst discovery and design using word embeddings from retrained SciBERT language model
    Muthukkumaran, Arun
    Raghunathan, Shrayas
    Ravichandran, Arjun
    Rengaswamy, Raghunathan
    [J]. AICHE JOURNAL, 2023, 69 (07)
  • [40] PheneBank: a literature-based database of phenotypes
    Pilehvar, Mohammad Taher
    Bernard, Adam
    Smedley, Damian
    Collier, Nigel
    [J]. BIOINFORMATICS, 2022, 38 (04) : 1179 - 1180