A categorical analysis of coreference resolution errors in biomedical texts

被引:6
作者
Choi, Miji [1 ,2 ]
Zobel, Justin [1 ]
Verspoor, Karin [1 ]
机构
[1] Univ Melbourne, Dept Comp & Informat Syst, Melbourne, Vic, Australia
[2] Natl ICT Australia NICTA, Victoria Res Lab, Sydney, NSW, Australia
基金
澳大利亚研究理事会;
关键词
Coreference resolution; Natural language processing; Text mining; Error analysis; EVENT EXTRACTION; NETWORK;
D O I
10.1016/j.jbi.2016.02.015
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Background: Coreference resolution is an essential task in information extraction from the published biomedical literature. It supports the discovery of complex information by linking referring expressions such as pronouns and appositives to their referents, which are typically entities that play a central role in biomedical events. Correctly establishing these links allows detailed understanding of all the participants in events, and connecting events together through their shared participants. Results: As an initial step towards the development of a novel coreference resolution system for the biomedical domain, we have categorised the characteristics of coreference relations by type of anaphor as well as broader syntactic and semantic characteristics, and have compared the performance of a domain adaptation of a state-of-the-art general system to published results from domain-specific systems in terms of this categorisation. We also develop a rule-based system for anaphoric coreference resolution in the biomedical domain with simple modules derived from available systems. Our results show that the domain-specific systems outperform the general system overall. Whilst this result is unsurprising, our proposed categorisation enables a detailed quantitative analysis of the system performance. We identify limitations of each system and find that there remain important gaps in the state-of-the-art systems, which are clearly identifiable with respect to the categorisation. Conclusion: We have analysed in detail the performance of existing coreference resolution systems for the biomedical literature and have demonstrated that there clear gaps in their coverage. The approach developed in the general domain needs to be tailored for portability to the biomedical domain. The specific framework for class-based error analysis of existing systems that we propose has benefits for identifying specific limitations of those systems. This in turn provides insights for further system development. (C) 2016 Elsevier Inc. All rights reserved.
引用
收藏
页码:309 / 318
页数:10
相关论文
共 48 条
  • [1] Event extraction for systems biology by text mining the literature
    Ananiadou, Sophia
    Pyysalo, Sampo
    Tsujii, Jun'ichi
    Kell, Douglas B.
    [J]. TRENDS IN BIOTECHNOLOGY, 2010, 28 (07) : 381 - 390
  • [2] [Anonymous], THESIS
  • [3] [Anonymous], 2009, P 2009 C EMP METH NA
  • [4] [Anonymous], 2011, Proceedings of the BioNLP Shared Task 2011 Workshop
  • [5] [Anonymous], 2005, P 43 ANN M ASS COMP, DOI DOI 10.3115/1219840.1219885
  • [6] [Anonymous], 2011, Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task
  • [7] [Anonymous], 2013, P BIONLP SHARED TASK
  • [8] [Anonymous], 2006, PROC 5 INT C LANGUAG
  • [9] [Anonymous], 1995, Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL-1995)
  • [10] [Anonymous], 2010, Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP '10