Large-scale extraction of drug-disease pairs from the medical literature

被引:20
作者
Wang, Pengwei [1 ]
Hao, Tianyong [2 ]
Yan, Jun [3 ]
Jin, Lianwen [1 ]
机构
[1] South China Univ Technol, Sch Elect & Informat Engn, Guangzhou, Guangdong, Peoples R China
[2] Guangdong Univ Foreign Studies, Cisco Sch Informat, Guangzhou, Guangdong, Peoples R China
[3] Microsoft Res Asia, Beijing, Peoples R China
关键词
KNOWLEDGE; ACQUISITION;
D O I
10.1002/asi.23876
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Automatic extraction of large-scale and accurate drug-disease pairs from the medical literature plays an important role for drug repurposing. However, many existing extraction methods are mainly in a supervised manner. It is costly and time-consuming to manually label drug-disease pairs datasets. There are many drug-disease pairs buried in free text. In this work, we first leverage a pattern-based method to automatically extract drug-disease pairs with treatment and inducement relationships from free text. Then, to reflect a drug-disease relation, a network embedding algorithm is proposed to calculate the degree of correlation of a drug-disease pair. In the experiments, we use the method to extract treatment and inducement drug-disease pairs from 27 million medical abstracts and titles available on PubMed. We extract 138,318 unique treatment pairs and 75,396 unique inducement pairs. Our algorithm achieves a precision of 0.912 and a recall of 0.898 in extracting the frequent treatment drug-disease pairs, and a precision of 0.923 and a recall of 0.833 in extracting the frequent inducement drug-disease pairs. Besides, our proposed information network embedding algorithm can efficiently reflect the degree of correlation of drug-disease pairs. Our algorithm can achieve a precision of 0.802, a recall of 0.783 in the fine-grained evaluation of extracting frequent pairs.
引用
收藏
页码:2649 / 2661
页数:13
相关论文
共 36 条
[21]   Article 50 million: an estimate of the number of scholarly articles in existence [J].
Jinha, Arif E. .
LEARNED PUBLISHING, 2010, 23 (03) :258-263
[22]   The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index [J].
Larsen, Peder Olesen ;
von Ins, Markus .
SCIENTOMETRICS, 2010, 84 (03) :575-603
[23]  
Lee C.-H., 2004, P 8 INT ISKO C, P245
[24]  
Miura Y., 2010, P 2 WORKSH NLP CHALL, P75
[25]   Protein-protein interaction extraction by leveraging multiple kernels and parsers [J].
Miwa, Makoto ;
Saetre, Rune ;
Miyao, Yusuke ;
Tsujii, Jun'ichi .
INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2009, 78 (12) :E39-E46
[26]   Machine-learning-assisted materials discovery using failed experiments [J].
Raccuglia, Paul ;
Elbert, Katherine C. ;
Adler, Philip D. F. ;
Falk, Casey ;
Wenny, Malia B. ;
Mollo, Aurelio ;
Zeller, Matthias ;
Friedler, Sorelle A. ;
Schrier, Joshua ;
Norquist, Alexander J. .
NATURE, 2016, 533 (7601) :73-+
[27]   The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text [J].
Rindflesch, TC ;
Fiszman, M .
JOURNAL OF BIOMEDICAL INFORMATICS, 2003, 36 (06) :462-477
[28]  
Rindflesch ThomasC., 2000, PACIFIC S BIOCOMPUTI
[29]   Discovery and Preclinical Validation of Drug Indications Using Compendia of Public Gene Expression Data [J].
Sirota, Marina ;
Dudley, Joel T. ;
Kim, Jeewon ;
Chiang, Annie P. ;
Morgan, Alex A. ;
Sweet-Cordero, Alejandro ;
Sage, Julien ;
Butte, Atul J. .
SCIENCE TRANSLATIONAL MEDICINE, 2011, 3 (96)
[30]  
van der Maaten L, 2014, J MACH LEARN RES, V15, P3221