Hybrid ontology-learning materials engineering system for pharmaceutical products: Multi-label entity recognition and concept detection

被引:14
作者
Remolona, Miguel Francisco M. [1 ,2 ]
Conway, Matthew F. [1 ]
Balasubramanian, Sriram [1 ]
Fan, Linxi [1 ]
Feng, Ziyan [1 ]
Gu, Tianhao [1 ]
Kim, Hyungtae [1 ]
Nirantar, Prasad M. [1 ]
Panda, Sarah [1 ]
Ranabothu, Nithin R. [1 ]
Rastogi, Neha [1 ]
Venkatasubramanian, Venkat [1 ]
机构
[1] Columbia Univ, Dept Chem Engn, Complex Resilient Intelligent Syst Lab, New York, NY 10027 USA
[2] Univ Philippines, Chem Engn Dept, Coll Engn, Quezon City, Philippines
关键词
Natural language processing; Entity recognition; Ontology; Machine learning; Concept detection; MODELING KNOWLEDGE MANAGEMENT;
D O I
10.1016/j.compchemeng.2017.03.012
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The dawn of a new era in knowledge management due to information explosion is making old habits of modeling knowledge and decision-making inadequate. In the search for new modeling paradigms, we expect ontologies to play a big role. One of the critical challenges we face is the scarcity of semantically rich, properly populated, ontologies in most application domains in chemical and materials engineering. Developing such ontologies is a very challenging task requiring considerable investment in time, effort, and expert knowledge. One needs automation tools that can assist an ontology engineer to quickly develop and curate domain-specific ontologies. We consider our conceptual framework in this paper, a general approach for populating scientific ontologies, and its implementation as the prototype HOLMES, as an early attempt towards such an automated knowledge management environment. Our approach integrates a variety of machine learning and natural language processing methods to extract information from journal articles and store them semantically in an ontology. In this work, identification of key terms (such as chemicals, drugs, processes, anatomical entities, etc.) from abstracts, and the classification of these terms into 25 classes are presented. Two methods, a multi-class classifier (SVM) and a multi-label classifier (HOMER), were tested on an annotated data set for the pharmaceutical industry. The test was done using two different versions of the same data set, one using the BIO notation and the other not. The F1 scores for HOMER, were better in the BIO notation (63.6% vs 48.5%) while SVM performed better in the non-BIO version (54.1% vs 53.2%). However, the standard metrics did not consider the effect of the multiple answers that the multi-label classifier is allowed to obtain. As the results of our computational experiments show, while the performance of multi-label classifier is encouraging, much more remains to be done in order to develop a practically viable automated ontology-based knowledge management system. (C) 2017 Elsevier Ltd. All rights reserved.
引用
收藏
页码:49 / 60
页数:12
相关论文
共 66 条
[1]  
Agichtein E., 2000, ACM 2000. Digital Libraries. Proceedings of the Fifth ACM Conference on Digital Libraries, P85, DOI 10.1145/336597.336644
[2]   Discovery informatics [J].
Agresti, WW .
COMMUNICATIONS OF THE ACM, 2003, 46 (08) :25-28
[3]  
[Anonymous], PAC S
[4]  
[Anonymous], ENGINEERING
[5]  
[Anonymous], 2013, EMNLP
[6]  
[Anonymous], 1997, P 5 APPL NAT LANG PR, DOI DOI 10.3115/974557.974586
[7]  
[Anonymous], OSRA OPTICAL STRUCTU
[8]  
[Anonymous], 1995, P 3 ACL WORKSH VER L
[9]  
[Anonymous], 14 AICHE ANN M ATL
[10]  
[Anonymous], 2010, J COMPUT