Biomedical-domain pre-trained language model for extractive summarization

被引:35
作者
Du, Yongping [1 ]
Li, Qingxiao [1 ]
Wang, Lulin [1 ]
He, Yanqing [2 ]
机构
[1] Beijing Univ Technol, Fac Informat Technol, Beijing 100124, Peoples R China
[2] Inst Sci & Tech Informat China, Beijing 100038, Peoples R China
基金
国家重点研发计划;
关键词
Extractive biomedical summarization; Document representation; Pre-trained language model; Fine-tuning; TEXT;
D O I
10.1016/j.knosys.2020.105964
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, the performance of deep neural network in extractive summarization task has been improved significantly compared with traditional methods. However, in the field of biomedical extractive summarization, existing methods cannot make good use of the domain-aware external knowledge; furthermore, the document structural feature is omitted by existing deep neural network model. In this paper, we propose a novel model called BioBERTSum to better capture token-level and sentence-level contextual representation, which uses a domain-aware bidirectional language model pre-trained on large-scale biomedical corpora as encoder, and further fine-tunes the language model for extractive text summarization task on single biomedical document. Especially, we adopt a sentence position embedding mechanism, which enables the model to learn the position information of sentences and achieve the structural feature of document. To the best of our knowledge, this is the first work to use the pre-trained language model and fine-tuning strategy for extractive summarization task in the biomedical domain. Experiments on PubMed dataset show that our proposed model outperforms the recent SOTA (state-of-the-art) model by ROUGE-1/2/L. (C) 2020 Elsevier B.V. All rights reserved.
引用
收藏
页数:9
相关论文
共 40 条
  • [1] [Anonymous], P 57 ANN M ASS COMP
  • [2] [Anonymous], P 57 ANN M ASS COMP
  • [3] [Anonymous], 2011, International Journal of Database Theory and Application.
  • [4] [Anonymous], CONTENT SELECTION DE
  • [5] [Anonymous], ARXIV190302861
  • [6] [Anonymous], BASED SYST
  • [7] Ba L. J., 2016, Layer Nor- malization
  • [8] MeSH: a window into full text for document summarization
    Bhattacharya, Sanmitra
    Viet Ha-Thuc
    Srinivasan, Padmini
    [J]. BIOINFORMATICS, 2011, 27 (13) : I120 - I128
  • [9] The Unified Medical Language System (UMLS): integrating biomedical terminology
    Bodenreider, O
    [J]. NUCLEIC ACIDS RESEARCH, 2004, 32 : D267 - D270
  • [10] Low-rank local tangent space embedding for subspace clustering
    Deng, Tingquan
    Ye, Dongsheng
    Ma, Rong
    Fujita, Hamido
    Xiong, Lvnan
    [J]. INFORMATION SCIENCES, 2020, 508 : 1 - 21