Automatic construction of English/Chinese parallel corpora

被引:30
|
作者
Yang, CC [1 ]
Li, KW [1 ]
机构
[1] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Sha Tin, Hong Kong, Peoples R China
来源
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY | 2003年 / 54卷 / 08期
关键词
D O I
10.1002/asi.10261
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
As the demand for global information increases significantly, multilingual corpora has become a valuable linguistic resource for applications to cross-lingual information retrieval and natural language processing. In order to cross the boundaries that exist between different languages, dictionaries are the most typical tools. However, the general-purpose dictionary is less sensitive in both genre and domain. It is also impractical to manually construct tailored bilingual dictionaries or sophisticated multilingual thesauri for large applications. Corpus-based approaches, which do not have the limitation of dictionaries, provide a statistical translation model with which to cross the language boundary. There are many domain-specific parallel or comparable corpora that are employed in machine translation and cross-lingual information retrieval. Most of these are corpora between Indo-European languages, such as English/French and English/Spanish. The Asian/Indo-European corpus, especially English/Chinese corpus, is relatively sparse. The objective of the present research is to construct English/Chinese parallel corpus automatically from the World Wide Web. In this paper, an alignment method is presented which is based on dynamic programming to identify the one-to-one Chinese and English title pairs. The method includes alignment at title level, word level and character level. The longest common subsequence (LCS) is applied to find the most reliable Chinese translation of an English word. As one word for a language may translate into two or more words repetitively in another language, the edit operation, deletion, is used to resolve redundancy. A score function is then proposed to determine the optimal title pairs. Experiments have been conducted to investigate the performance of the proposed method using the daily press release articles by the Hong Kong SAR government as the test bed. The precision of the result is 0.998 while the recall is 0.806. The release articles and speech articles, published by Hongkong & Shanghai Banking Corporation Limited, are also used to test our method, the precision is 1.00, and the recall is 0.948.
引用
收藏
页码:730 / 742
页数:13
相关论文
共 50 条
  • [21] Automatic Construction of Discourse Corpora for Dialogue Translation
    Wang, Longyue
    Zhang, Xiaojun
    Tu, Zhaopeng
    Way, Andy
    Liu, Qun
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 2748 - 2754
  • [22] Learner construction of corpora for general English in Taiwan
    Smith, Simon
    COMPUTER ASSISTED LANGUAGE LEARNING, 2011, 24 (04) : 291 - 316
  • [23] Building English - Punjabi Aligned Parallel Corpora of Nouns from Comparable Corpora
    Kaur, Dilshad
    Singh, Satwinder
    APPLIED COMPUTER SYSTEMS, 2023, 28 (02) : 245 - 251
  • [24] Creating Chinese-English Comparable Corpora
    Huang, Degen
    Wang, Shanshan
    Ren, Fuji
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2013, E96D (08): : 1853 - 1861
  • [25] Parallel Phrase Extraction from English-Vietnamese Parallel Corpora
    Le, Quang-Hung
    Le, Anh-Cuong
    Huynh, Van-Nam
    PROCEEDINGS OF 2013 IEEE RIVF INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION TECHNOLOGIES: RESEARCH, INNOVATION, AND VISION FOR THE FUTURE (RIVF), 2013, : 175 - 179
  • [26] Automatic Parallel Corpora and Bilingual Terminology extraction from Parallel WebSites
    Almeida, Jose Joao
    Simoes, Alberto
    LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 50 - 55
  • [27] The Application of Parallel Corpora in the Translation Teaching of College English
    Wu, Jiaping
    Peng, Dejing
    2016 5TH EEM INTERNATIONAL CONFERENCE ON PUBLIC ADMINISTRATION & MANAGEMENT (EEM-PAM 2016), 2016, 91 : 106 - 111
  • [28] Extracting Chinese-English Bilingual Core Terminology from Parallel Classified Corpora in Special Domain
    Zhang, Chengzhi
    2009 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCES ON WEB INTELLIGENCE (WI) AND INTELLIGENT AGENT TECHNOLOGIES (IAT), VOL 3, 2009, : 271 - 274
  • [29] The Corpora of China English: Implications for an EFL Dictionary for Chinese Learners of English
    Xia, Lixin
    Xia, Yun
    Zhang, Yihua
    Nesi, Hilary
    LEXIKOS, 2016, 26 : 416 - 435
  • [30] English article acquisition by Chinese learners of English: An analysis of two corpora
    Leroux, William
    Kendall, Tyler
    SYSTEM, 2018, 76 : 13 - 24