Unsupervised statistical text simplification using pre-trained language modeling for initialization

被引:9
|
作者
Qiang, Jipeng [1 ]
Zhang, Feng [1 ]
Li, Yun [1 ]
Yuan, Yunhao [1 ]
Zhu, Yi [1 ]
Wu, Xindong [2 ,3 ]
机构
[1] Yangzhou Univ, Dept Comp Sci, Yangzhou 225127, Jiangsu, Peoples R China
[2] Hefei Univ Technol, Minist Educ, Key Lab Knowledge Engn Big Data, Hefei 23009, Peoples R China
[3] Mininglamp Acad Sci, Mininglamp, Beijing 100089, Peoples R China
基金
中国国家自然科学基金;
关键词
text simplification; pre-trained language modeling; BERT; word embeddings;
D O I
10.1007/s11704-022-1244-0
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Unsupervised text simplification has attracted much attention due to the scarcity of high-quality parallel text simplification corpora. Recent an unsupervised statistical text simplification based on phrase-based machine translation system (UnsupPBMT) achieved good performance, which initializes the phrase tables using the similar words obtained by word embedding modeling. Since word embedding modeling only considers the relevance between words, the phrase table in UnsupPBMT contains a lot of dissimilar words. In this paper, we propose an unsupervised statistical text simplification using pre-trained language modeling BERT for initialization. Specifically, we use BERT as a general linguistic knowledge base for predicting similar words. Experimental results show that our method outperforms the state-of-the-art unsupervised text simplification methods on three benchmarks, even outperforms some supervised baselines.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] A Pre-trained Clinical Language Model for Acute Kidney Injury
    Mao, Chengsheng
    Yao, Liang
    Luo, Yuan
    2020 8TH IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS (ICHI 2020), 2020, : 531 - 532
  • [22] The Impact of Training Methods on the Development of Pre-Trained Language Models
    Uribe, Diego
    Cuan, Enrique
    Urquizo, Elisa
    COMPUTACION Y SISTEMAS, 2024, 28 (01): : 109 - 124
  • [23] Aspect Based Sentiment Analysis by Pre-trained Language Representations
    Liang Tianxin
    Yang Xiaoping
    Zhou Xibo
    Wang Bingqian
    2019 IEEE INTL CONF ON PARALLEL & DISTRIBUTED PROCESSING WITH APPLICATIONS, BIG DATA & CLOUD COMPUTING, SUSTAINABLE COMPUTING & COMMUNICATIONS, SOCIAL COMPUTING & NETWORKING (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2019), 2019, : 1262 - 1265
  • [24] SsciBERT: a pre-trained language model for social science texts
    Shen, Si
    Liu, Jiangfeng
    Lin, Litao
    Huang, Ying
    Zhang, Lin
    Liu, Chang
    Feng, Yutong
    Wang, Dongbo
    SCIENTOMETRICS, 2023, 128 (02) : 1241 - 1263
  • [25] Impact of data quality for automatic issue classification using pre-trained language models
    Colavito, Giuseppe
    Lanubile, Filippo
    Novielli, Nicole
    Quaranta, Luigi
    JOURNAL OF SYSTEMS AND SOFTWARE, 2024, 210
  • [26] Identifying Valid User Stories Using BERT Pre-trained Natural Language Models
    Scoggin, Sandor Borges
    Marques-Neto, Humberto Torres
    INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 3, WORLDCIST 2023, 2024, 801 : 167 - 177
  • [27] Quantifying Gender Bias in Arabic Pre-Trained Language Models
    Alrajhi, Wafa
    Al-Khalifa, Hend S.
    Al-Salman, Abdulmalik S.
    IEEE ACCESS, 2024, 12 : 77406 - 77420
  • [28] Unsupervised law article mining based on deep pre-trained language representation models with application to the Italian civil code
    Tagarelli, Andrea
    Simeri, Andrea
    ARTIFICIAL INTELLIGENCE AND LAW, 2022, 30 (03) : 417 - 473
  • [29] Unsupervised law article mining based on deep pre-trained language representation models with application to the Italian civil code
    Andrea Tagarelli
    Andrea Simeri
    Artificial Intelligence and Law, 2022, 30 : 417 - 473
  • [30] Text based personality prediction from multiple social media data sources using pre-trained language model and model averaging
    Christian, Hans
    Suhartono, Derwin
    Chowanda, Andry
    Zamli, Kamal Z.
    JOURNAL OF BIG DATA, 2021, 8 (01)