English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT

被引：0

作者：

Alammar, Mai ^{[1
,2
]}

El Hindi, Khalil ^{[1
]}

Al-Khalifa, Hend ^{[1
]}

机构：

[1] King Saud Univ, Coll Comp & Informat Sci, Dept Comp Sci, Riyadh 11451, Saudi Arabia

[2] Imam Mohammad Ibn Saud Islamic Univ IMSIU, Coll Comp & Informat Sci, Comp Sci Dept, Riyadh 11564, Saudi Arabia

来源：

COMPUTATION | 2025年 / 13卷 / 06期

关键词：

text chunking; Arabic text chunking; semantic chunking; siamese network; BERT; semantic textual similarity; transfer learning; SEGMENTATION;

D O I：

10.3390/computation13060151

中图分类号：

O1 [数学];

学科分类号：

0701 ; 070101 ;

摘要：

Semantic text chunking refers to segmenting text into coherently semantic chunks, i.e., into sets of statements that are semantically related. Semantic chunking is an essential pre-processing step in various NLP tasks e.g., document summarization, sentiment analysis and question answering. In this paper, we propose a hybrid chunking; two-steps semantic text chunking method that combines the effectiveness of unsupervised semantic text chunking based on the similarities between sentences embeddings and the pre-trained language models (PLMs) especially BERT by fine-tuning the BERT on semantic textual similarity task (STS) to provide a flexible and effective semantic text chunking. We evaluated the proposed method in English and Arabic. To the best of our knowledge, there is an absence of an Arabic dataset created to assess semantic text chunking at this level. Therefore, we created an AraWiki50k to evaluate our proposed text chunking method inspired by an existing English dataset. Our experiments showed that exploiting the fine-tuned pre-trained BERT on STS enhances results over unsupervised semantic chunking by an average of 7.4 in the PK metric and by an average of 11.19 in the WindowDiff metric on four English evaluation datasets, and 0.12 in the PK and 2.29 in the WindowDiff for the Arabic dataset.

引用

页数：37

共 55 条

[1]

Ahmad R, 2024, Arxiv, DOI [arXiv:2401.01511, 10.48550/arXiv.2401.01511]

[2]

Alessandro S., 2021, arXiv

[3] Employing a Multilingual Transformer Model for Segmenting Unpunctuated Arabic Text [J].

Alshanqiti, Abdullah M. ;

Albouq, Sami ;

Alkhodre, Ahmad B. ;

Namoun, Abdallah ;

Nabil, Emad .

APPLIED SCIENCES-BASEL, 2022, 12 (20)

[4]

[Anonymous], 2013, P 2013 C N AM CHAPT

[5]

Antoun W, 2021, Arxiv, DOI arXiv:2003.00104

[6] SECTOR: A Neural Model for Coherent Topic Segmentation and Classification [J].

Arnold, Sebastian ;

Schneider, Rudolf ;

Cudre-Mauroux, Philippe ;

Ger, Felix A. ;

Loeser, Alexander .

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2019, 7 :169-184

[7]

Aumiller D., 2021, P 18 INT C ART INT L, P2

[8] Attention-Based Neural Text Segmentation [J].

Badjatiya, Pinkesh ;

Kurisinkel, Litton J. ;

Gupta, Manish ;

Varma, Vasudeva .

ADVANCES IN INFORMATION RETRIEVAL (ECIR 2018), 2018, 10772 :180-193

[9]

Barrow J, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P313

[10] Statistical models for text segmentation [J].

Beeferman, D ;

Berger, A ;

Lafferty, J .

MACHINE LEARNING, 1999, 34 (1-3) :177-210

← 1 2 3 4 5 6 →