Large-scale extraction of drug-disease pairs from the medical literature

被引:17
作者
Wang, Pengwei [1 ]
Hao, Tianyong [2 ]
Yan, Jun [3 ]
Jin, Lianwen [1 ]
机构
[1] South China Univ Technol, Sch Elect & Informat Engn, Guangzhou, Guangdong, Peoples R China
[2] Guangdong Univ Foreign Studies, Cisco Sch Informat, Guangzhou, Guangdong, Peoples R China
[3] Microsoft Res Asia, Beijing, Peoples R China
关键词
KNOWLEDGE; ACQUISITION;
D O I
10.1002/asi.23876
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Automatic extraction of large-scale and accurate drug-disease pairs from the medical literature plays an important role for drug repurposing. However, many existing extraction methods are mainly in a supervised manner. It is costly and time-consuming to manually label drug-disease pairs datasets. There are many drug-disease pairs buried in free text. In this work, we first leverage a pattern-based method to automatically extract drug-disease pairs with treatment and inducement relationships from free text. Then, to reflect a drug-disease relation, a network embedding algorithm is proposed to calculate the degree of correlation of a drug-disease pair. In the experiments, we use the method to extract treatment and inducement drug-disease pairs from 27 million medical abstracts and titles available on PubMed. We extract 138,318 unique treatment pairs and 75,396 unique inducement pairs. Our algorithm achieves a precision of 0.912 and a recall of 0.898 in extracting the frequent treatment drug-disease pairs, and a precision of 0.923 and a recall of 0.833 in extracting the frequent inducement drug-disease pairs. Besides, our proposed information network embedding algorithm can efficiently reflect the degree of correlation of drug-disease pairs. Our algorithm can achieve a precision of 0.802, a recall of 0.783 in the fine-grained evaluation of extracting frequent pairs.
引用
收藏
页码:2649 / 2661
页数:13
相关论文
共 36 条
  • [21] Article 50 million: an estimate of the number of scholarly articles in existence
    Jinha, Arif E.
    [J]. LEARNED PUBLISHING, 2010, 23 (03) : 258 - 263
  • [22] The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index
    Larsen, Peder Olesen
    von Ins, Markus
    [J]. SCIENTOMETRICS, 2010, 84 (03) : 575 - 603
  • [23] Lee C.-H., 2004, P 8 INT ISKO C, P245
  • [24] Miura Y., 2010, P 2 WORKSH NLP CHALL, P75
  • [25] Protein-protein interaction extraction by leveraging multiple kernels and parsers
    Miwa, Makoto
    Saetre, Rune
    Miyao, Yusuke
    Tsujii, Jun'ichi
    [J]. INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2009, 78 (12) : E39 - E46
  • [26] Machine-learning-assisted materials discovery using failed experiments
    Raccuglia, Paul
    Elbert, Katherine C.
    Adler, Philip D. F.
    Falk, Casey
    Wenny, Malia B.
    Mollo, Aurelio
    Zeller, Matthias
    Friedler, Sorelle A.
    Schrier, Joshua
    Norquist, Alexander J.
    [J]. NATURE, 2016, 533 (7601) : 73 - +
  • [27] The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text
    Rindflesch, TC
    Fiszman, M
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2003, 36 (06) : 462 - 477
  • [28] Rindflesch ThomasC., 2000, PACIFIC S BIOCOMPUTI
  • [29] Discovery and Preclinical Validation of Drug Indications Using Compendia of Public Gene Expression Data
    Sirota, Marina
    Dudley, Joel T.
    Kim, Jeewon
    Chiang, Annie P.
    Morgan, Alex A.
    Sweet-Cordero, Alejandro
    Sage, Julien
    Butte, Atul J.
    [J]. SCIENCE TRANSLATIONAL MEDICINE, 2011, 3 (96)
  • [30] van der Maaten L, 2014, J MACH LEARN RES, V15, P3221