Text Simplification Using Transformer and BERT

被引：4

作者：

Alissa, Sarah ^{[1
]}

Wald, Mike ^{[2
]}

机构：

[1] Imam Abdulrahman Bin Faisal Univ, Coll Comp Sci & Informat Technol, Dammam, Saudi Arabia

[2] Univ Southampton, Sch Elect & Comp Sci, Southampton, England

来源：

CMC-COMPUTERS MATERIALS & CONTINUA | 2023年 / 75卷 / 02期

关键词：

Text simplification; neural machine translation; transformer;

D O I：

10.32604/cmc.2023.033647

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Reading and writing are the main interaction methods with web content. Text simplification tools are helpful for people with cognitive impairments, new language learners, and children as they might find difficulties in understanding the complex web content. Text simplification is the process of changing complex text into more readable and understandable text. The recent approaches to text simplification adopted the machine translation concept to learn simplification rules from a parallel corpus of complex and simple sentences. In this paper, we propose two models based on the transformer which is an encoder-decoder structure that achieves state-of-the-art (SOTA) results in machine translation. The training process for our model includes three steps: preprocessing the data using a subword tokenizer, training the model and optimizing it using the Adam optimizer, then using the model to decode the output. The first model uses the transformer only and the second model uses and integrates the Bidirectional Encoder Representations from Transformer (BERT) as encoder to enhance the training time and results. The performance of the proposed model using the transformer was evaluated using the Bilingual Evaluation Understudy score (BLEU) and recorded (53.78) on the WikiSmall dataset. On the other hand, the experiment on the second model which is integrated with BERT shows that the validation loss decreased very fast compared with the model without the BERT. However, the BLEU score was small (44.54), which could be due to the size of the dataset so the model was overfitting and unable to generalize well. Therefore, in the future, the second model could involve experimenting with a larger dataset such as the WikiLarge. In addition, more analysis has been done on the model's results and the used dataset using different evaluation metrics to understand their performance.

引用

页码：3479 / 3495

页数：17

共 50 条

[31] Scene Text Recognition with Transformer using Multi-patches
Wang Y.
Ha J.-E.
Journal of Institute of Control, Robotics and Systems, 2022, 28 (10) : 862 - 867
[32] Adapting Text Simplification Decisions to Different Text Genres and Target Users
Stajner, Sanja
Saggion, Horacio
PROCESAMIENTO DEL LENGUAJE NATURAL, 2013, (51): : 135 - 142
[33] Leveraging Social Media for Medical Text Simplification
Pattisapu, Nikhil
Prabhu, Nishant
Bhati, Smriti
Varma, Vasudeva
PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, : 851 - 860
[34] A survey of automated methods for biomedical text simplification
Ondov, Brian
Attal, Kush
Demner-Fushman, Dina
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2022, 29 (11) : 1976 - 1988
[35] Text Simplification from Professionally Produced Corpora
Scarton, Carolina
Paetzold, Gustavo Henrique
Specia, Lucia
PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3504 - 3510
[36] Unsupervised statistical text simplification using pre-trained language modeling for initialization
Jipeng Qiang
Feng Zhang
Yun Li
Yunhao Yuan
Yi Zhu
Xindong Wu
Frontiers of Computer Science, 2023, 17
[37] Studying the Effect of Syntactic Simplification on Text Summarization
Chatterjee, Niladri
Agarwal, Raksha
IETE TECHNICAL REVIEW, 2023, 40 (02) : 155 - 166
[38] Unsupervised statistical text simplification using pre-trained language modeling for initialization
QIANG Jipeng
ZHANG Feng
LI Yun
YUAN Yunhao
ZHU Yi
WU Xindong
Frontiers of Computer Science, 2023, 17 (01)
[39] Multilingual Controllable Transformer-Based Lexical Simplification
Sheang, Kim Cheng
Saggion, Horacio
PROCESAMIENTO DEL LENGUAJE NATURAL, 2023, (71): : 109 - 123
[40] HaT5: Hate Language Identification using Text-to-Text Transfer Transformer
Sabry, Sana Sabah
Adewumi, Tosin
Abid, Nosheen
Kovacs, Gyorgy
Liwicki, Foteini
Liwicki, Marcus
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,

← 1 2 3 4 5 →