COMPARATIVE STUDY OF TOPIC SEGMENTATION ALGORITHMS BASED ON LEXICAL COHESION: EXPERIMENTAL RESULTS ON ARABIC LANGUAGE

被引:0
作者
Harrag, Fouzi [1 ]
Hamdi-Cherif, Aboubekeur [2 ]
Al-Salman, Abdulmalik Salman [3 ]
机构
[1] Farhat Abbas Univ, Dept Comp Sci, Setif 19000, Algeria
[2] Qassim Univ, Dept Comp Sci, Buraydah 51452, Saudi Arabia
[3] King Saud Univ, Coll Comp & Informat Sci, Riyadh 11543, Saudi Arabia
关键词
natural language processing; Arabic language processing; information retrieval; topic segmentation; text tiling algorithm; C99; algorithm; TEXT;
D O I
暂无
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Topic segmentation is essential for a lot of Natural Language Processing (NLP) applications, such as text summarization or information extraction. The objective of this research is to evaluate the effectiveness of topic segmentation algorithms in identifying the thematic breaks in Arabic texts. For this aim, a group of 7 readers are asked to identify the changes of theme that they discerned in 5 Arabic texts of different domains. The resulting judgments are used to evaluate the relative performance of two of the main algorithms of segmentation proposed in the literature: C99 and Text Tiling, using the classical Recall/Precision evaluation metrics and the recently introduced Reader Judgment method. The experimental results show that with only a few improvements, existing algorithms for segmenting English texts are also efficient for segmenting Arabic texts.
引用
收藏
页码:183 / 202
页数:20
相关论文
共 56 条
[1]  
ALSHALABI R, 2003, P INT AR C INF TECHN
[2]  
[Anonymous], STEMMING ARABIC TEXT
[3]  
[Anonymous], 1998, THESIS U PENNSYLVANI
[4]  
[Anonymous], 2008, Introduction to information retrieval
[5]  
[Anonymous], P 31 ANN M ASS COMP, DOI DOI 10.1016/S0306-4573(02)00035-3
[6]  
Attia M., 2004, MULTILINGUAL COMPUTI
[7]  
ATTIA M, 2000, THESIS FACULTY ENG C
[8]   Fassieh®, a Semi-Automatic Visual Interactive Tool for Morphological, PoS-Tags, Phonetic, and Semantic Annotation of Arabic Text Corpora [J].
Attia, Mohamed ;
Rashwan, Mohsen A. A. ;
Al-Badrashiny, Mohamed A. S. A. A. .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2009, 17 (05) :916-925
[9]  
Baeza-Yates Ricardo., MODERN INFORM RETRIE
[10]   Statistical models for text segmentation [J].
Beeferman, D ;
Berger, A ;
Lafferty, J .
MACHINE LEARNING, 1999, 34 (1-3) :177-210