Large-scale extraction of drug-disease pairs from the medical literature

被引：20

作者：

Wang, Pengwei ^{[1
]}

Hao, Tianyong ^{[2
]}

Yan, Jun ^{[3
]}

Jin, Lianwen ^{[1
]}

机构：

[1] South China Univ Technol, Sch Elect & Informat Engn, Guangzhou, Guangdong, Peoples R China

[2] Guangdong Univ Foreign Studies, Cisco Sch Informat, Guangzhou, Guangdong, Peoples R China

[3] Microsoft Res Asia, Beijing, Peoples R China

来源：

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY | 2017年 / 68卷 / 11期

关键词：

KNOWLEDGE; ACQUISITION;

D O I：

10.1002/asi.23876

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Automatic extraction of large-scale and accurate drug-disease pairs from the medical literature plays an important role for drug repurposing. However, many existing extraction methods are mainly in a supervised manner. It is costly and time-consuming to manually label drug-disease pairs datasets. There are many drug-disease pairs buried in free text. In this work, we first leverage a pattern-based method to automatically extract drug-disease pairs with treatment and inducement relationships from free text. Then, to reflect a drug-disease relation, a network embedding algorithm is proposed to calculate the degree of correlation of a drug-disease pair. In the experiments, we use the method to extract treatment and inducement drug-disease pairs from 27 million medical abstracts and titles available on PubMed. We extract 138,318 unique treatment pairs and 75,396 unique inducement pairs. Our algorithm achieves a precision of 0.912 and a recall of 0.898 in extracting the frequent treatment drug-disease pairs, and a precision of 0.923 and a recall of 0.833 in extracting the frequent inducement drug-disease pairs. Besides, our proposed information network embedding algorithm can efficiently reflect the degree of correlation of drug-disease pairs. Our algorithm can achieve a precision of 0.802, a recall of 0.783 in the fine-grained evaluation of extracting frequent pairs.

引用

页码：2649 / 2661

页数：13

共 36 条

[21] Article 50 million: an estimate of the number of scholarly articles in existence [J].

Jinha, Arif E. .

LEARNED PUBLISHING, 2010, 23 (03) :258-263

[22] The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index [J].

Larsen, Peder Olesen ;

von Ins, Markus .

SCIENTOMETRICS, 2010, 84 (03) :575-603

[23]

Lee C.-H., 2004, P 8 INT ISKO C, P245

[24]

Miura Y., 2010, P 2 WORKSH NLP CHALL, P75

[25] Protein-protein interaction extraction by leveraging multiple kernels and parsers [J].

Miwa, Makoto ;

Saetre, Rune ;

Miyao, Yusuke ;

Tsujii, Jun'ichi .

INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2009, 78 (12) :E39-E46

[26] Machine-learning-assisted materials discovery using failed experiments [J].

Raccuglia, Paul ;

Elbert, Katherine C. ;

Adler, Philip D. F. ;

Falk, Casey ;

Wenny, Malia B. ;

Mollo, Aurelio ;

Zeller, Matthias ;

Friedler, Sorelle A. ;

Schrier, Joshua ;

Norquist, Alexander J. .

NATURE, 2016, 533 (7601) :73-+

[27] The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text [J].

Rindflesch, TC ;

Fiszman, M .

JOURNAL OF BIOMEDICAL INFORMATICS, 2003, 36 (06) :462-477

[28]

Rindflesch ThomasC., 2000, PACIFIC S BIOCOMPUTI

[29] Discovery and Preclinical Validation of Drug Indications Using Compendia of Public Gene Expression Data [J].

Sirota, Marina ;

Dudley, Joel T. ;

Kim, Jeewon ;

Chiang, Annie P. ;

Morgan, Alex A. ;

Sweet-Cordero, Alejandro ;

Sage, Julien ;

Butte, Atul J. .

SCIENCE TRANSLATIONAL MEDICINE, 2011, 3 (96)

[30]

van der Maaten L, 2014, J MACH LEARN RES, V15, P3221

← 1 2 3 4 →