Turkish abstractive text summarization using pretrained sequence-to-sequence models

被引：10

作者：

Baykara, Batuhan ^{[1
]}

Gungor, Tunga ^{[1
]}

机构：

[1] Bogazici Univ, Dept Comp Engn, TR-34342 Istanbul, Turkey

来源：

NATURAL LANGUAGE ENGINEERING | 2023年 / 29卷 / 05期

关键词：

Abstractive text summarization; News title generation; Pretrained sequence-to-sequence models;

D O I：

10.1017/S1351324922000195

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The tremendous amount of increase in the number of documents available on the Web has turned finding the relevant piece of information into a challenging, tedious, and time-consuming activity. Accordingly, automatic text summarization has become an important field of study by gaining significant attention from the researchers. Lately, with the advances in deep learning, neural abstractive text summarization with sequence-to-sequence (Seq2Seq) models has gained popularity. There have been many improvements in these models such as the use of pretrained language models (e.g., GPT, BERT, and XLM) and pretrained Seq2Seq models (e.g., BART and T5). These improvements have addressed certain shortcomings in neural summarization and have improved upon challenges such as saliency, fluency, and semantics which enable generating higher quality summaries. Unfortunately, these research attempts were mostly limited to the English language. Monolingual BERT models and multilingual pretrained Seq2Seq models have been released recently providing the opportunity to utilize such state-of-the-art models in low-resource languages such as Turkish. In this study, we make use of pretrained Seq2Seq models and obtain state-of-the-art results on the two large-scale Turkish datasets, TR-News and MLSum, for the text summarization task. Then, we utilize the title information in the datasets and establish hard baselines for the title generation task on both datasets. We show that the input to the models has a substantial amount of importance for the success of such tasks. Additionally, we provide extensive analysis of the models including cross-dataset evaluations, various text generation options, and the effect of preprocessing in ROUGE evaluations for Turkish. It is shown that the monolingual BERT models outperform the multilingual BERT models on all tasks across all the datasets. Lastly, qualitative evaluations of the generated summaries and titles of the models are provided.

引用

页码：1275 / 1304

页数：30

共 57 条

[1]

Altan, 2004, IASTED INT C AIA

[2]

[Anonymous], 2019, CoRR, DOI DOI 10.48550/arXiv.1907.11692

[3] Abstractive text summarization and new large-scale datasets for agglutinative languages Turkish and Hungarian [J].

Baykara, Batuhan ;

Gungor, Tunga .

LANGUAGE RESOURCES AND EVALUATION, 2022, 56 (03) :973-1007

[4]

Bostrom K, 2020, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, P4617

[5]

Celikyilmaz A, 2018, ARXIVABS180310357 CO

[6]

Chan B., 2020, ABS201010906 CORR

[7]

Chopra Sumit, 2016, 15 ANN C N AM CHAPT

[8]

Cigir C, 2009, 2009 24TH INTERNATIONAL SYMPOSIUM ON COMPUTER AND INFORMATION SCIENCES, P223

[9]

Conneau A., 2020, ACL, P8440, DOI DOI 10.18653/V1/2020.ACL-MAIN.747

[10]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

← 1 2 3 4 5 6 →