Biomedical-domain pre-trained language model for extractive summarization

被引：35

作者：

Du, Yongping ^{[1
]}

Li, Qingxiao ^{[1
]}

Wang, Lulin ^{[1
]}

He, Yanqing ^{[2
]}

机构：

[1] Beijing Univ Technol, Fac Informat Technol, Beijing 100124, Peoples R China

[2] Inst Sci & Tech Informat China, Beijing 100038, Peoples R China

来源：

KNOWLEDGE-BASED SYSTEMS | 2020年 / 199卷 / 199期

基金：

国家重点研发计划;

关键词：

Extractive biomedical summarization; Document representation; Pre-trained language model; Fine-tuning; TEXT;

D O I：

10.1016/j.knosys.2020.105964

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In recent years, the performance of deep neural network in extractive summarization task has been improved significantly compared with traditional methods. However, in the field of biomedical extractive summarization, existing methods cannot make good use of the domain-aware external knowledge; furthermore, the document structural feature is omitted by existing deep neural network model. In this paper, we propose a novel model called BioBERTSum to better capture token-level and sentence-level contextual representation, which uses a domain-aware bidirectional language model pre-trained on large-scale biomedical corpora as encoder, and further fine-tunes the language model for extractive text summarization task on single biomedical document. Especially, we adopt a sentence position embedding mechanism, which enables the model to learn the position information of sentences and achieve the structural feature of document. To the best of our knowledge, this is the first work to use the pre-trained language model and fine-tuning strategy for extractive summarization task in the biomedical domain. Experiments on PubMed dataset show that our proposed model outperforms the recent SOTA (state-of-the-art) model by ROUGE-1/2/L. (C) 2020 Elsevier B.V. All rights reserved.

引用

页数：9

共 40 条

[1] [Anonymous], P 57 ANN M ASS COMP
[2] [Anonymous], P 57 ANN M ASS COMP
[3] [Anonymous], 2011, International Journal of Database Theory and Application.
[4] [Anonymous], CONTENT SELECTION DE
[5] [Anonymous], ARXIV190302861
[6] [Anonymous], BASED SYST
[7] Ba L. J., 2016, Layer Nor- malization
[8] MeSH: a window into full text for document summarization
Bhattacharya, Sanmitra
Viet Ha-Thuc
Srinivasan, Padmini
[J]. BIOINFORMATICS, 2011, 27 (13) : I120 - I128
[9] The Unified Medical Language System (UMLS): integrating biomedical terminology
Bodenreider, O
[J]. NUCLEIC ACIDS RESEARCH, 2004, 32 : D267 - D270
[10] Low-rank local tangent space embedding for subspace clustering
Deng, Tingquan
Ye, Dongsheng
Ma, Rong
Fujita, Hamido
Xiong, Lvnan
[J]. INFORMATION SCIENCES, 2020, 508 : 1 - 21

← 1 2 3 4 →