Improving performance of natural language processing part-of-speech tagging on clinical narratives through domain adaptation

被引:21
|
作者
Ferraro, Jeffrey P. [1 ,2 ]
Daume, Hal, III [3 ]
DuVall, Scott L. [4 ,5 ]
Chapman, Wendy W. [6 ]
Harkema, Henk [7 ]
Haug, Peter J. [1 ,2 ]
机构
[1] Univ Utah, Dept Biomed Informat, Salt Lake City, UT USA
[2] Intermt Healthcare, Homer Warner Ctr Informat Res, Salt Lake City, UT USA
[3] Univ Maryland, Dept Comp Sci, College Pk, MD 20742 USA
[4] Univ Utah, Dept Internal Med, Salt Lake City, UT USA
[5] VA Salt Lake City Healthcare Syst, Salt Lake City, UT USA
[6] Univ Calif San Diego, Dept Biomed Informat, La Jolla, CA 92093 USA
[7] Nuance Commun, Pittsburgh, PA USA
关键词
Natural Language Processing; NLP; POS Tagging; Domain Adaptation; Clinical Narratives; SAMPLE SELECTION; SYSTEM; TEXT; CORPUS; NLP;
D O I
10.1136/amiajnl-2012-001453
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective Natural language processing (NLP) tasks are commonly decomposed into subtasks, chained together to form processing pipelines. The residual error produced in these subtasks propagates, adversely affecting the end objectives. Limited availability of annotated clinical data remains a barrier to reaching state-of-the-art operating characteristics using statistically based NLP tools in the clinical domain. Here we explore the unique linguistic constructions of clinical texts and demonstrate the loss in operating characteristics when out-of-the-box part-of-speech (POS) tagging tools are applied to the clinical domain. We test a domain adaptation approach integrating a novel lexical-generation probability rule used in a transformation-based learner to boost POS performance on clinical narratives. Methods Two target corpora from independent healthcare institutions were constructed from high frequency clinical narratives. Four leading POS taggers with their out-of-the-box models trained from general English and biomedical abstracts were evaluated against these clinical corpora. A high performing domain adaptation method, Easy Adapt, was compared to our newly proposed method ClinAdapt. Results The evaluated POS taggers drop in accuracy by 8.5-15% when tested on clinical narratives. The highest performing tagger reports an accuracy of 88.6%. Domain adaptation with Easy Adapt reports accuracies of 88.3-91.0% on clinical texts. ClinAdapt reports 93.2-93.9%. Conclusions ClinAdapt successfully boosts POS tagging performance through domain adaptation requiring a modest amount of annotated clinical data. Improving the performance of critical NLP subtasks is expected to reduce pipeline error propagation leading to better overall results on complex processing tasks.
引用
收藏
页码:931 / 939
页数:9
相关论文
共 50 条
  • [31] Improving Part-of-Speech Tagging of Historical Text by First Translating to Modern Text
    Sang, Erik Tjong Kim
    COMPUTATIONAL HISTORY AND DATA-DRIVEN HUMANITIES, CHDDH 2016, 2016, 482 : 54 - 64
  • [32] Joint Part-of-Speech and Language ID Tagging for Code-Switched Data
    Soto, Victor
    Hirschberg, Julia
    COMPUTATIONAL APPROACHES TO LINGUISTIC CODE-SWITCHING, 2018, : 1 - 10
  • [33] Time Series Neural Network Model for Part-of-Speech Tagging Indonesian Language
    Tanadi, Theo
    INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND DIGITAL APPLICATIONS (ICITDA 2017), 2018, 325
  • [34] The Use of Part-of-Speech Tagging on E-Newspaper in Improving Grammar Teaching Pedagogy
    Omar, Ruzana
    Yusoff, Sarah
    Ab Rashid, Radzuwan
    Mohamad, Azweed
    Yunus, Kamariah
    INTERNATIONAL JOURNAL OF ENGLISH LINGUISTICS, 2018, 8 (07) : 1 - 6
  • [35] Part-of-Speech Tagging of Odia Language Using Statistical and Deep Learning Based Approaches
    Dalai, Tusarkanta
    Mishra, Tapas Kumar
    Sa, Pankaj K.
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (06)
  • [36] Exploring and categorising the Arabic copula and auxiliary kana through enhanced part-of-speech tagging
    Hardie, Andrew
    Ibrahim, Wesam
    CORPORA, 2021, 16 (03) : 305 - 335
  • [37] Part-of-Speech Tagging for Code-Switched, Transliterated Texts without Explicit Language Identification
    Ball, Kelsey
    Garrette, Dan
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 3084 - 3089
  • [38] Challenging Language-Dependent Segmentation for Arabic: An Application to Machine Translation and Part-of-Speech Tagging
    Sajjad, Hassan
    Dalvi, Fahim
    Durrani, Nadir
    Abdelali, Ahmed
    Belinkov, Yonatan
    Vogel, Stephan
    PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 2, 2017, : 601 - 607
  • [39] A TENGRAM method based part-of-speech tagging of multi-category words in Hindi language
    Gupta, J. P.
    Tayal, Devendra K.
    Gupta, Arti
    EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (12) : 15084 - 15093
  • [40] A standard tag set expounding traditional morphological features for Arabic language part-of-speech tagging
    Sawalha, Majdi
    Atwell, Eric
    WORD STRUCTURE, 2013, 6 (01) : 43 - 99