Automatic construction of English/Chinese parallel corpora

被引:30
作者
Yang, CC [1 ]
Li, KW [1 ]
机构
[1] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Sha Tin, Hong Kong, Peoples R China
来源
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY | 2003年 / 54卷 / 08期
关键词
D O I
10.1002/asi.10261
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
As the demand for global information increases significantly, multilingual corpora has become a valuable linguistic resource for applications to cross-lingual information retrieval and natural language processing. In order to cross the boundaries that exist between different languages, dictionaries are the most typical tools. However, the general-purpose dictionary is less sensitive in both genre and domain. It is also impractical to manually construct tailored bilingual dictionaries or sophisticated multilingual thesauri for large applications. Corpus-based approaches, which do not have the limitation of dictionaries, provide a statistical translation model with which to cross the language boundary. There are many domain-specific parallel or comparable corpora that are employed in machine translation and cross-lingual information retrieval. Most of these are corpora between Indo-European languages, such as English/French and English/Spanish. The Asian/Indo-European corpus, especially English/Chinese corpus, is relatively sparse. The objective of the present research is to construct English/Chinese parallel corpus automatically from the World Wide Web. In this paper, an alignment method is presented which is based on dynamic programming to identify the one-to-one Chinese and English title pairs. The method includes alignment at title level, word level and character level. The longest common subsequence (LCS) is applied to find the most reliable Chinese translation of an English word. As one word for a language may translate into two or more words repetitively in another language, the edit operation, deletion, is used to resolve redundancy. A score function is then proposed to determine the optimal title pairs. Experiments have been conducted to investigate the performance of the proposed method using the daily press release articles by the Hong Kong SAR government as the test bed. The precision of the result is 0.998 while the recall is 0.806. The release articles and speech articles, published by Hongkong & Shanghai Banking Corporation Limited, are also used to test our method, the precision is 1.00, and the recall is 0.948.
引用
收藏
页码:730 / 742
页数:13
相关论文
共 50 条
[32]   Creating a Reusable English-Chinese Parallel Corpus for Bilingual Dictionary Construction [J].
Dalianis, Hercules ;
Xing, Hao-chun ;
Zhang, Xin .
LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, :1700-1705
[33]   Automatic detection of phonological change in Chinese rhymed corpora [J].
Baley, Julien .
LANGUAGE AND LINGUISTICS, 2025,
[34]   A new Alignment algorithm for Parallel Corpora of Japanese and Chinese [J].
Quan, Yuhua ;
Jin, Ying-hao ;
Quan, Jingji .
2011 INTERNATIONAL CONFERENCE ON ELECTRONICS, COMMUNICATIONS AND CONTROL (ICECC), 2011, :3498-3501
[35]   Automatic Acquisition of Parallel Corpora from Websites with Dynamic Content [J].
Tsvetkov, Yulia ;
Wintner, Shuly .
LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, :3389-3392
[36]   Parallel Corpora Preparation for English-Amharic Machine Translation [J].
Biadgligne, Yohanens ;
Smaili, Kamel .
ADVANCES IN COMPUTATIONAL INTELLIGENCE, IWANN 2021, PT I, 2021, 12861 :443-455
[37]   Using parallel corpora to analyse the language of contracts in English and Polish [J].
Gozdz-Roszkowski, S .
PALC'99: PRACTICAL APPLICATIONS IN LANGUAGE CORPORA, 2000, 1 :553-565
[38]   Automatic Concept Discovery from Parallel Text and Visual Corpora [J].
Sun, Chen ;
Gan, Chuang ;
Nevatia, Ram .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2596-2604
[39]   Automatic Dictionary Expansion Using Non-parallel Corpora [J].
Rapp, Reinhard ;
Zock, Michael .
ADVANCES IN DATA ANALYSIS, DATA HANDLING AND BUSINESS INTELLIGENCE, 2010, :317-+
[40]   The automatic construction of large-scale corpora for summarization research [J].
Marcu, D .
SIGIR'99: PROCEEDINGS OF 22ND INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 1999, :137-144