A Comparative Analysis of Active Learning for Biomedical Text Mining

被引:29
作者
Naseem, Usman [1 ]
Khushi, Matloob [1 ]
Khan, Shah Khalid [2 ]
Shaukat, Kamran [3 ]
Moni, Mohammad Ali [4 ]
机构
[1] Univ Sydney, Sch Comp Sci, Sydney, NSW 2006, Australia
[2] RMIT Univ, Sch Engn, Carlton, Vic 3053, Australia
[3] Univ Newcastle, Sch Elect Engn & Comp, Newcastle, NSW 2308, Australia
[4] Univ New South Wales, Fac Med, WHO Ctr eHlth, UNSW Digital Hlth, Sydney, NSW 2052, Australia
关键词
active learning; machine learning; biomedical natural language processing; NAMED ENTITY RECOGNITION; INFORMATION EXTRACTION; CLASSIFICATION; MANAGEMENT; CRF;
D O I
10.3390/asi4010023
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
An enormous amount of clinical free-text information, such as pathology reports, progress reports, clinical notes and discharge summaries have been collected at hospitals and medical care clinics. These data provide an opportunity of developing many useful machine learning applications if the data could be transferred into a learn-able structure with appropriate labels for supervised learning. The annotation of this data has to be performed by qualified clinical experts, hence, limiting the use of this data due to the high cost of annotation. An underutilised technique of machine learning that can label new data called active learning (AL) is a promising candidate to address the high cost of the label the data. AL has been successfully applied to labelling speech recognition and text classification, however, there is a lack of literature investigating its use for clinical purposes. We performed a comparative investigation of various AL techniques using ML and deep learning (DL)-based strategies on three unique biomedical datasets. We investigated random sampling (RS), least confidence (LC), informative diversity and density (IDD), margin and maximum representativeness-diversity (MRD) AL query strategies. Our experiments show that AL has the potential to significantly reducing the cost of manual labelling. Furthermore, pre-labelling performed using AL expediates the labelling process by reducing the time required for labelling.
引用
收藏
页数:18
相关论文
共 72 条
  • [1] Aggarwal CC, 2014, CH CRC DATA MIN KNOW, P1
  • [2] [Anonymous], 2001, ARXIVCS0109015
  • [3] [Anonymous], 2013, ICLR WORKSHOP POSTER, DOI DOI 10.48550/ARXIV.1301.3781
  • [4] Anwar M.W., 2015, Int. J. Hybrid Inf. Technol, V8, P279
  • [5] Bahdanau D., 2014, ARXIV PREPRINT ARXIV
  • [6] Automatic semantic classification of scientific literature according to the hallmarks of cancer
    Baker, Simon
    Silins, Ilona
    Guo, Yufan
    Ali, Imran
    Hogberg, Johan
    Stenius, Ulla
    Korhonen, Anna
    [J]. BIOINFORMATICS, 2016, 32 (03) : 432 - 440
  • [7] Bashyam V, 2007, STUD HEALTH TECHNOL, V129, P545
  • [8] Beltagy I., 2019, SCIBERT PRETRAINED L, DOI DOI 10.48550/ARXIV.1903.10676
  • [9] Representation Learning: A Review and New Perspectives
    Bengio, Yoshua
    Courville, Aaron
    Vincent, Pascal
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (08) : 1798 - 1828
  • [10] Bostrom H., 2012, Recall (micro), V97, P90