Arabic abstractive text summarization using RNN-based and transformer-based architectures

被引：33

作者：

Bani-Almarjeh, Mohammad ^{[1
]}

Kurdy, Mohamad-Bassam ^{[1
,2
,3
]}

机构：

[1] Syrian Virtual Univ, Damascus, Syria

[2] ESC Rennes Sch Business, Rennes, France

[3] Burgundy Sch Business Dijon, Dijon, France

来源：

INFORMATION PROCESSING & MANAGEMENT | 2023年 / 60卷 / 02期

关键词：

Natural language processing; Deep learning; Transfer learning; Text summarization;

D O I：

10.1016/j.ipm.2022.103227

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Recently, the Transformer model architecture and the pre-trained Transformer-based language models have shown impressive performance when used in solving both natural language un-derstanding and text generation tasks. Nevertheless, there is little research done on using these models for text generation in Arabic. This research aims at leveraging and comparing the per-formance of different model architectures, including RNN-based and Transformer-based ones, and different pre-trained language models, including mBERT, AraBERT, AraGPT2, and AraT5 for Arabic abstractive summarization. We first built an Arabic summarization dataset of 84,764 high-quality text-summary pairs. To use mBERT and AraBERT in the context of text summarization, we employed a BERT2BERT-based encoder-decoder model where we initialized both the encoder and decoder with the respective model weights. The proposed models have been tested using ROUGE metrics and manual human evaluation. We also compared their performance on out-of-domain data. Our pre-trained Transformer-based models give a large improvement in performance with-79% less data. We found that AraT5 scores-3 ROUGE higher than a BERT2BERT-based model that is initialized with AraBERT, indicating that an encoder-decoder pre-trained Transformer is more suitable for summarizing Arabic text. Also, both of these two models perform better than AraGPT2 by a clear margin, which we found to produce summaries with high readability but with relatively lesser quality. On the other hand, we found that both AraT5 and AraGPT2 are better at summarizing out-of-domain text. We released our models and dataset publicly1,.2

引用

页数：18

共 47 条

[1]

Abdelali A., 2016, P C N AM CHAPT ASS C, P11, DOI [DOI 10.18653/V1/N16, 10.18653/v1/N16-3003]

[2]

Abdul-Mageed M., 2021, P 59 ANN M ASS COMP, P7088, DOI [10.18653/v1/2021.acl-long.551, DOI 10.18653/V1/2021.ACL-LONG.551]

[3]

Aghajanyan A, 2020, Arxiv, DOI arXiv:2008.03156

[4] Arabic text summarization using deep learning approach [J].

Al-Maleh, Molham ;

Desouki, Said .

JOURNAL OF BIG DATA, 2020, 7 (01)

[5] TAAM: Topic-aware abstractive arabic text summarisation using deep recurrent neural networks [J].

Alahmadi, Dimah ;

Wali, Arwa ;

Alzahrani, Sarah .

JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2022, 34 (06) :2651-2665

[6] Enhancing unsupervised neural networks based text summarization with word embedding and ensemble learning [J].

Alami, Nabil ;

Meknassi, Mohammed ;

En-nahnahi, Noureddine .

EXPERT SYSTEMS WITH APPLICATIONS, 2019, 123 :195-211

[7] Using Unsupervised Deep Learning for Automatic Summarization of Arabic Documents [J].

Alami, Nabil ;

En-nahnahi, Noureddine ;

Ouatik, Said Alaoui ;

Meknassi, Mohammed .

ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2018, 43 (12) :7803-7815

[8]

Almarjeh Mohammad Bani, 2022, Mendeley Data, V1, DOI 10.17632/7KR75C9H24.1

[9]

Alotaiby F., 2009, P 2 INT C AR LANG RE, P78

[10]

Alotaiby F, 2010, PROCEEDINGS OF THE 24TH PACIFIC ASIA CONFERENCE ON LANGUAGE, INFORMATION AND COMPUTATION, P595

← 1 2 3 4 5 →