Large-scale online semantic indexing of biomedical articles via an ensemble of multi-label classification models

被引:12
作者
Papanikolaou, Yannis [1 ]
Tsoumakas, Grigorios [1 ]
Laliotis, Manos [2 ]
Markantonatos, Nikos [3 ]
Vlahavas, Ioannis [1 ]
机构
[1] Aristotle Univ Thessaloniki, Dept Comp Sci, Thessaloniki 54124, Greece
[2] Atypon, 5201 Great America Pkwy Suite 510, Santa Clara, CA 95054 USA
[3] Atypon Hellas, Dimitrakopoulou 7, Athens 15341, Greece
来源
JOURNAL OF BIOMEDICAL SEMANTICS | 2017年 / 8卷
关键词
Semantic indexing; Multi-label ensemble; Machine learning; BioASQ; Supervised learning; Multi-label learning;
D O I
10.1186/s13326-017-0150-0
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: In this paper we present the approach that we employed to deal with large scale multi-label semantic indexing of biomedical papers. This work was mainly implemented within the context of the BioASQ challenge (2013-2017), a challenge concerned with biomedical semantic indexing and question answering. Methods: Our main contribution is a MUlti-Label Ensemble method (MULE) that incorporates a McNemar statistical significance test in order to validate the combination of the constituent machine learning algorithms. Some secondary contributions include a study on the temporal aspects of the BioASQ corpus (observations apply also to the BioASQ's super-set, the PubMed articles collection) and the proper parametrization of the algorithms used to deal with this challenging classification task. Results: The ensemble method that we developed is compared to other approaches in experimental scenarios with subsets of the BioASQ corpus giving positive results. In our participation in the BioASQ challenge we obtained the first place in 2013 and the second place in the four following years, steadily outperforming MTI, the indexing system of the National Library of Medicine (NLM). Conclusions: The results of our experimental comparisons, suggest that employing a statistical significance test to validate the ensemble method's choices, is the optimal approach for ensembling multi-label classifiers, especially in contexts with many rare labels.
引用
收藏
页数:13
相关论文
共 26 条
  • [1] Alessandro A., 2013, International Joint Conference on Artificial Intelligence, P1220
  • [2] [Anonymous], 2007, Technical report
  • [3] Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference
    Cesa-Bianchi, Nicolo
    Re, Matteo
    Valentini, Giorgio
    [J]. MACHINE LEARNING, 2012, 88 (1-2) : 209 - 241
  • [4] Power-Law Distributions in Empirical Data
    Clauset, Aaron
    Shalizi, Cosma Rohilla
    Newman, M. E. J.
    [J]. SIAM REVIEW, 2009, 51 (04) : 661 - 703
  • [5] Demsar J, 2006, J MACH LEARN RES, V7, P1
  • [6] Ensemble methods in machine learning
    Dietterich, TG
    [J]. MULTIPLE CLASSIFIER SYSTEMS, 2000, 1857 : 1 - 15
  • [7] The McNemar test for binary matched-pairs data: mid-p and asymptotic are better than exact conditional
    Fagerland, Morten W.
    Lydersen, Stian
    Laake, Petter
    [J]. BMC MEDICAL RESEARCH METHODOLOGY, 2013, 13
  • [8] Fan RE, 2008, J MACH LEARN RES, V9, P1871
  • [9] Multilabel classification via calibrated label ranking
    Fuernkranz, Johannes
    Huellermeier, Eyke
    Mencia, Eneldo Loza
    Brinker, Klaus
    [J]. MACHINE LEARNING, 2008, 73 (02) : 133 - 153
  • [10] Godbole S, 2004, LECT NOTES ARTIF INT, V3056, P22