Unsupervised statistical text simplification using pre-trained language modeling for initialization

被引:9
|
作者
Qiang, Jipeng [1 ]
Zhang, Feng [1 ]
Li, Yun [1 ]
Yuan, Yunhao [1 ]
Zhu, Yi [1 ]
Wu, Xindong [2 ,3 ]
机构
[1] Yangzhou Univ, Dept Comp Sci, Yangzhou 225127, Jiangsu, Peoples R China
[2] Hefei Univ Technol, Minist Educ, Key Lab Knowledge Engn Big Data, Hefei 23009, Peoples R China
[3] Mininglamp Acad Sci, Mininglamp, Beijing 100089, Peoples R China
基金
中国国家自然科学基金;
关键词
text simplification; pre-trained language modeling; BERT; word embeddings;
D O I
10.1007/s11704-022-1244-0
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Unsupervised text simplification has attracted much attention due to the scarcity of high-quality parallel text simplification corpora. Recent an unsupervised statistical text simplification based on phrase-based machine translation system (UnsupPBMT) achieved good performance, which initializes the phrase tables using the similar words obtained by word embedding modeling. Since word embedding modeling only considers the relevance between words, the phrase table in UnsupPBMT contains a lot of dissimilar words. In this paper, we propose an unsupervised statistical text simplification using pre-trained language modeling BERT for initialization. Specifically, we use BERT as a general linguistic knowledge base for predicting similar words. Experimental results show that our method outperforms the state-of-the-art unsupervised text simplification methods on three benchmarks, even outperforms some supervised baselines.
引用
收藏
页数:10
相关论文
共 50 条
  • [41] SiBert: Enhanced Chinese Pre-trained Language Model with Sentence Insertion
    Chen, Jiahao
    Cao, Chenjie
    Jiang, Xiuyan
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2405 - 2412
  • [42] The impact of using pre-trained word embeddings in Sinhala chatbots
    Gamage, Bimsara
    Pushpananda, Randil
    Weerasinghe, Ruvan
    2020 20TH INTERNATIONAL CONFERENCE ON ADVANCES IN ICT FOR EMERGING REGIONS (ICTER-2020), 2020, : 161 - 165
  • [43] Comprehensive study of pre-trained language models: detecting humor in news headlines
    Farah Shatnawi
    Malak Abdullah
    Mahmoud Hammad
    Mahmoud Al-Ayyoub
    Soft Computing, 2023, 27 : 2575 - 2599
  • [44] Addressing Extraction and Generation Separately: Keyphrase Prediction With Pre-Trained Language Models
    Liu, Rui
    Lin, Zheng
    Wang, Weiping
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 3180 - 3191
  • [45] Recent Progress on Named Entity Recognition Based on Pre-trained Language Models
    Yang, Binxia
    Luo, Xudong
    2023 IEEE 35TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2023, : 799 - 804
  • [46] Pre-trained Bert for Natural Language Guided Reinforcement Learning in Atari Game
    Li, Xin
    Zhang, Yu
    Luo, Junren
    Liu, Yifeng
    2022 34TH CHINESE CONTROL AND DECISION CONFERENCE, CCDC, 2022, : 5119 - 5124
  • [47] A Light Bug Triage Framework for Applying Large Pre-trained Language Model
    Lee, Jaehyung
    Han, Kisun
    Yu, Hwanjo
    PROCEEDINGS OF THE 37TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE 2022, 2022,
  • [48] Disambiguating Clinical Abbreviations using Pre-trained Word Embeddings
    Jaber, Areej
    Martinez, Paloma
    HEALTHINF: PROCEEDINGS OF THE 14TH INTERNATIONAL JOINT CONFERENCE ON BIOMEDICAL ENGINEERING SYSTEMS AND TECHNOLOGIES - VOL. 5: HEALTHINF, 2021, : 501 - 508
  • [49] Comprehensive study of pre-trained language models: detecting humor in news headlines
    Shatnawi, Farah
    Abdullah, Malak
    Hammad, Mahmoud
    Al-Ayyoub, Mahmoud
    SOFT COMPUTING, 2023, 27 (05) : 2575 - 2599
  • [50] Online Fake News Detection using Pre-trained Embeddings
    Reshi, Junaid Ali
    Ali, Rashid
    2022 5TH INTERNATIONAL CONFERENCE ON MULTIMEDIA, SIGNAL PROCESSING AND COMMUNICATION TECHNOLOGIES (IMPACT), 2022,