Extracting Parallel Paragraphs and Sentences from English-Persian Translated Documents

被引：0

作者：

Rasooli, Mohammad Sadegh ^{[1
]}

Kashefi, Omid ^{[1
]}

Minaei-Bidgoli, Behrouz ^{[1
]}

机构：

[1] Iran Univ Sci & Technol, Dept Comp Engn, Tehran, Iran

来源：

INFORMATION RETRIEVAL TECHNOLOGY | 2011年 / 7097卷

关键词：

Sentence Alignment; Paragraph Alignment; Parallel Corpus; Bilingual Corpus; Persian; English; Machine Translation; ALIGNMENT;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The task of sentence and paragraph alignment is essential for preparing parallel texts that are needed in applications such as machine translation. The lack of sufficient linguistic data for under-resourced languages like Persian is a challenging issue. In this paper, we proposed a hybrid sentence and paragraph alignment model on Persian-English parallel documents based on simple linguistic features as well as length similarity between sentences and paragraphs of source and target languages. We apply a small bilingual dictionary of Persian-English nouns, punctuation marks, and length similarity as alignment metrics. We combine these features in a linear model and use genetic algorithm to learn the linear equation weights. Evaluation results show that the extracted features improve the baseline model which is only a length-based one.

引用

页码：574 / 583

页数：10

共 6 条

[1] Constructing a Large-Scale English-Persian Parallel Corpus
Miangah, Tayebeh Mosavi
META, 2009, 54 (01) : 181 - 188
[2] TPC: An Automatically Generated Comprehensive English-Persian Parallel Corpus
Farzi, Saeed
Faili, Heshaam
2017 5TH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL AND BUSINESS INTELLIGENCE (ISCBI), 2017, : 91 - 95
[3] Extracting parallel fragments from comparable documents using a generative model
Bakhshaei, Somayeh
Safabakhsh, Reza
Khadivi, Shahram
COMPUTER SPEECH AND LANGUAGE, 2019, 53 : 25 - 42
[4] Speaker-Audience Interaction in Spoken Political Discourse : A Contrastive Parallel Corpus-Based Study of English-Persian Translation of Metadiscourse Features in TED Talks
Mehrdad Vasheghani Farahani
Reza Kazemian
Corpus Pragmatics, 2021, 5 : 271 - 298
[5] Extracting Parallel Fragments from Comparable Documents Using a Feature-Based Method
Rahimi, Zeinab
Samani, Mohammad Hossein
Khadivi, Shahram
ARTIFICIAL INTELLIGENCE AND SIGNAL PROCESSING, AISP 2013, 2014, 427 : 288 - +
[6] Speaker-Audience Interaction in Spoken Political Discourse : A Contrastive Parallel Corpus-Based Study of English-Persian Translation of Metadiscourse Features in TED Talks
Farahani, Mehrdad Vasheghani
Kazemian, Reza
CORPUS PRAGMATICS, 2021, 5 (02) : 271 - 298

← 1 →