Data Sampling and Supervised Learning for HIV Literature Screening

被引:17
作者
Almeida, Hayda [1 ]
Meurs, Marie-Jean [2 ,3 ]
Kosseim, Leila [4 ]
Tsang, Adrian [1 ]
机构
[1] Concordia Univ, CSFG, Montreal, PQ, Canada
[2] Univ Quebec, Dept Comp Sci, Montreal, PQ, Canada
[3] CSFG, Montreal, PQ, Canada
[4] Concordia Univ, Dept Comp Sci & Software Engn, Montreal, PQ, Canada
关键词
Artificial intelligence; health information management; HIV; machine learning; text classification; triage; TEXT CLASSIFICATION; SYSTEMATIC REVIEWS; IMBALANCED DATA;
D O I
10.1109/TNB.2016.2565481
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
This paper presents a supervised learning approach to support the screening of HIV literature. The manual screening of biomedical literature is an important task in the process of systematic reviews. Researchers and curators have the very demanding, time-consuming, and error-prone task of manually identifying documents that should be included in a systematic review concerning a specific problem. We developed a supervised learning approach to support screening tasks, by automatically flagging potentially relevant documents from a list retrieved by a literature database search. To overcome the main issues associated with the automatic literature screening task, we evaluated the use of data sampling, feature combinations, and feature selection methods, generating a total of 105 classification models. The models yielding the best results were composed of a Logistic Model Trees classifier, a fairly balanced training set, and feature combination of Bag-Of-Words and MeSH terms. According to our results, the system correctly labels the great majority of relevant documents, making it usable to support HIV systematic reviews to allow researchers to assess a greater number of documents in less time.
引用
收藏
页码:354 / 361
页数:8
相关论文
共 43 条
  • [1] Applying support vector machines to imbalanced datasets
    Akbani, R
    Kwek, S
    Japkowicz, N
    [J]. MACHINE LEARNING: ECML 2004, PROCEEDINGS, 2004, 3201 : 39 - 50
  • [2] Machine Learning for Biomedical Literature Triage
    Almeida, Hayda
    Meurs, Marie-Jean
    Kosseim, Leila
    Butler, Greg
    Tsang, Adrian
    [J]. PLOS ONE, 2014, 9 (12):
  • [3] Almeida Tiago A., 2011, Journal of Internet Services and Applications, V1, P183, DOI 10.1007/s13174-010-0014-7
  • [4] [Anonymous], 2003, C45 CLASS IMBALANCE
  • [5] [Anonymous], 2003, ICML-2003 Workshop on Learning from Imbalanced Data Sets II
  • [6] [Anonymous], 2001, EFFECT CLASS DISTRIB
  • [7] BioCreative-IV virtual issue
    Arighi, Cecilia N.
    Wu, Cathy H.
    Cohen, Kevin B.
    Hirschman, Lynette
    Krallinger, Martin
    Valencia, Alfonso
    Lu, Zhiyong
    Wilbur, John W.
    Wiegers, Thomas C.
    [J]. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2014,
  • [8] Arighi CN, 2013, DATABASE-OXFORD, P2013
  • [9] Effective Text Classification by a Supervised Feature Selection Approach
    Basu, Tanmay
    Murthy, C. A.
    [J]. 12TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2012), 2012, : 918 - 925
  • [10] Screening nonrandomized studies for medical systematic reviews: A comparative study of classifiers
    Bekhuis, Tanja
    Demner-Fushman, Dina
    [J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2012, 55 (03) : 197 - 207