Multi-domain evaluation framework for named entity recognition tools

被引:10
作者
Abdallah, Zahraa S. [1 ]
Carman, Mark [1 ]
Haffari, Gholamreza [1 ]
机构
[1] Monash Univ, Sch Informat Technol, Clayton, Vic, Australia
关键词
Named entity recognition; Multi-domain evaluation; Qualitative data analysis; Benchmark evaluation;
D O I
10.1016/j.csl.2016.10.003
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Extracting structured information from unstructured text is important for the qualitative data analysis. Leveraging NLP techniques for qualitative data analysis will effectively accelerate the annotation process, allow for large-scale analysis and provide more insights into the text to improve the performance. The first step for gaining insights from the text is Named Entity Recognition (NER). A significant challenge that directly impacts the performance of the NER process is the domain diversity in qualitative data. The represented text varies according to its domain in many aspects including taxonomies, length, formality and format. In this paper we discuss and analyse the performance of state-of-the-art tools across domains to elaborate their robustness and reliability. In order to do that, we developed a standard, expandable and flexible framework to analyse and test tools performance using corpora representing text across various domains. We performed extensive analysis and comparison of tools across various domains and from various perspectives. The resulting comparison and analysis are of significant importance for providing a holistic illustration of the state-of-the-art tools. (C) 2016 Elsevier Ltd. All rights reserved.
引用
收藏
页码:34 / 55
页数:22
相关论文
共 31 条
[1]  
Abdul-Hamid A.Darwish., 2010, P NAMED ENTITIES WOR, P110
[2]  
[Anonymous], 2004, Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA), DOI 10.3115/1567594.1567618
[3]  
[Anonymous], P 8 INT C LANG RES E
[4]  
[Anonymous], 2002, Principal components analysis
[5]  
[Anonymous], 2005, P 43 ANN M ASS COMP, DOI DOI 10.3115/1219840.1219885
[6]  
[Anonymous], 2011, P 2011 C EMPIRICAL M
[7]  
[Anonymous], 2011, P 7 INT C SEM SYST, DOI [10.1145/2063518.2063519, DOI 10.1145/2063518.2063519]
[8]  
[Anonymous], 2007, EMNLP CONLL, DOI DOI 10.1145/2187836.2187900
[9]  
[Anonymous], INFORM SCI I TECHNIC
[10]  
Asahara M, 2003, HLT-NAACL 2003: HUMAN LANGUAGE TECHNOLOGY CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE MAIN CONFERENCE, P8