Automatic construction of English/Chinese parallel corpora

被引：30

作者：

Yang, CC ^{[1
]}

Li, KW ^{[1
]}

机构：

[1] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Sha Tin, Hong Kong, Peoples R China

来源：

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY | 2003年 / 54卷 / 08期

关键词：

D O I：

10.1002/asi.10261

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

As the demand for global information increases significantly, multilingual corpora has become a valuable linguistic resource for applications to cross-lingual information retrieval and natural language processing. In order to cross the boundaries that exist between different languages, dictionaries are the most typical tools. However, the general-purpose dictionary is less sensitive in both genre and domain. It is also impractical to manually construct tailored bilingual dictionaries or sophisticated multilingual thesauri for large applications. Corpus-based approaches, which do not have the limitation of dictionaries, provide a statistical translation model with which to cross the language boundary. There are many domain-specific parallel or comparable corpora that are employed in machine translation and cross-lingual information retrieval. Most of these are corpora between Indo-European languages, such as English/French and English/Spanish. The Asian/Indo-European corpus, especially English/Chinese corpus, is relatively sparse. The objective of the present research is to construct English/Chinese parallel corpus automatically from the World Wide Web. In this paper, an alignment method is presented which is based on dynamic programming to identify the one-to-one Chinese and English title pairs. The method includes alignment at title level, word level and character level. The longest common subsequence (LCS) is applied to find the most reliable Chinese translation of an English word. As one word for a language may translate into two or more words repetitively in another language, the edit operation, deletion, is used to resolve redundancy. A score function is then proposed to determine the optimal title pairs. Experiments have been conducted to investigate the performance of the proposed method using the daily press release articles by the Hong Kong SAR government as the test bed. The precision of the result is 0.998 while the recall is 0.806. The release articles and speech articles, published by Hongkong & Shanghai Banking Corporation Limited, are also used to test our method, the precision is 1.00, and the recall is 0.948.

引用

页码：730 / 742

页数：13

共 50 条

[21] Automatic Construction of Discourse Corpora for Dialogue Translation
Wang, Longyue
Zhang, Xiaojun
Tu, Zhaopeng
Way, Andy
Liu, Qun
LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 2748 - 2754
[22] Learner construction of corpora for general English in Taiwan
Smith, Simon
COMPUTER ASSISTED LANGUAGE LEARNING, 2011, 24 (04) : 291 - 316
[23] Building English - Punjabi Aligned Parallel Corpora of Nouns from Comparable Corpora
Kaur, Dilshad
Singh, Satwinder
APPLIED COMPUTER SYSTEMS, 2023, 28 (02) : 245 - 251
[24] Creating Chinese-English Comparable Corpora
Huang, Degen
Wang, Shanshan
Ren, Fuji
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2013, E96D (08): : 1853 - 1861
[25] Parallel Phrase Extraction from English-Vietnamese Parallel Corpora
Le, Quang-Hung
Le, Anh-Cuong
Huynh, Van-Nam
PROCEEDINGS OF 2013 IEEE RIVF INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION TECHNOLOGIES: RESEARCH, INNOVATION, AND VISION FOR THE FUTURE (RIVF), 2013, : 175 - 179
[26] Automatic Parallel Corpora and Bilingual Terminology extraction from Parallel WebSites
Almeida, Jose Joao
Simoes, Alberto
LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 50 - 55
[27] The Application of Parallel Corpora in the Translation Teaching of College English
Wu, Jiaping
Peng, Dejing
2016 5TH EEM INTERNATIONAL CONFERENCE ON PUBLIC ADMINISTRATION & MANAGEMENT (EEM-PAM 2016), 2016, 91 : 106 - 111
[28] Extracting Chinese-English Bilingual Core Terminology from Parallel Classified Corpora in Special Domain
Zhang, Chengzhi
2009 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCES ON WEB INTELLIGENCE (WI) AND INTELLIGENT AGENT TECHNOLOGIES (IAT), VOL 3, 2009, : 271 - 274
[29] The Corpora of China English: Implications for an EFL Dictionary for Chinese Learners of English
Xia, Lixin
Xia, Yun
Zhang, Yihua
Nesi, Hilary
LEXIKOS, 2016, 26 : 416 - 435
[30] English article acquisition by Chinese learners of English: An analysis of two corpora
Leroux, William
Kendall, Tyler
SYSTEM, 2018, 76 : 13 - 24

← 1 2 3 4 5 →