Extracting Parallel Paragraphs and Sentences from English-Persian Translated Documents

被引:0
|
作者
Rasooli, Mohammad Sadegh [1 ]
Kashefi, Omid [1 ]
Minaei-Bidgoli, Behrouz [1 ]
机构
[1] Iran Univ Sci & Technol, Dept Comp Engn, Tehran, Iran
来源
INFORMATION RETRIEVAL TECHNOLOGY | 2011年 / 7097卷
关键词
Sentence Alignment; Paragraph Alignment; Parallel Corpus; Bilingual Corpus; Persian; English; Machine Translation; ALIGNMENT;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The task of sentence and paragraph alignment is essential for preparing parallel texts that are needed in applications such as machine translation. The lack of sufficient linguistic data for under-resourced languages like Persian is a challenging issue. In this paper, we proposed a hybrid sentence and paragraph alignment model on Persian-English parallel documents based on simple linguistic features as well as length similarity between sentences and paragraphs of source and target languages. We apply a small bilingual dictionary of Persian-English nouns, punctuation marks, and length similarity as alignment metrics. We combine these features in a linear model and use genetic algorithm to learn the linear equation weights. Evaluation results show that the extracted features improve the baseline model which is only a length-based one.
引用
收藏
页码:574 / 583
页数:10
相关论文
共 6 条