Automatic construction of English/Chinese parallel corpora

被引:30
|
作者
Yang, CC [1 ]
Li, KW [1 ]
机构
[1] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Sha Tin, Hong Kong, Peoples R China
来源
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY | 2003年 / 54卷 / 08期
关键词
D O I
10.1002/asi.10261
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
As the demand for global information increases significantly, multilingual corpora has become a valuable linguistic resource for applications to cross-lingual information retrieval and natural language processing. In order to cross the boundaries that exist between different languages, dictionaries are the most typical tools. However, the general-purpose dictionary is less sensitive in both genre and domain. It is also impractical to manually construct tailored bilingual dictionaries or sophisticated multilingual thesauri for large applications. Corpus-based approaches, which do not have the limitation of dictionaries, provide a statistical translation model with which to cross the language boundary. There are many domain-specific parallel or comparable corpora that are employed in machine translation and cross-lingual information retrieval. Most of these are corpora between Indo-European languages, such as English/French and English/Spanish. The Asian/Indo-European corpus, especially English/Chinese corpus, is relatively sparse. The objective of the present research is to construct English/Chinese parallel corpus automatically from the World Wide Web. In this paper, an alignment method is presented which is based on dynamic programming to identify the one-to-one Chinese and English title pairs. The method includes alignment at title level, word level and character level. The longest common subsequence (LCS) is applied to find the most reliable Chinese translation of an English word. As one word for a language may translate into two or more words repetitively in another language, the edit operation, deletion, is used to resolve redundancy. A score function is then proposed to determine the optimal title pairs. Experiments have been conducted to investigate the performance of the proposed method using the daily press release articles by the Hong Kong SAR government as the test bed. The precision of the result is 0.998 while the recall is 0.806. The release articles and speech articles, published by Hongkong & Shanghai Banking Corporation Limited, are also used to test our method, the precision is 1.00, and the recall is 0.948.
引用
收藏
页码:730 / 742
页数:13
相关论文
共 50 条
  • [1] Automatic construction of web-based English/Chinese parallel corpora
    Tan Bin
    Zhou Xu-yan
    2010 THIRD INTERNATIONAL SYMPOSIUM ON INTELLIGENT INFORMATION TECHNOLOGY AND SECURITY INFORMATICS (IITSI 2010), 2010, : 114 - 117
  • [2] Parallel Chinese-English Entities, Relations and Events Corpora
    Mott, Justin
    Song, Zhiyi
    Bies, Ann
    Strassel, Stephanie
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 3717 - 3722
  • [3] Issues in building English-Chinese parallel corpora with WordNets
    Bond, Francis
    Wang, Shan
    PROCEEDINGS OF THE SEVENTH GLOBAL WORDNET CONFERENCE, GWC 2014, 2014, : 391 - 399
  • [4] Automatic summarization of Chinese and English parallel documents
    Wang, FL
    Yang, CC
    DIGITAL LIBRARIES: TECHNOLOGY AND MANAGEMENT OF INDIGENOUS KNOWLEDGE FOR GLOBAL ACCESS, 2003, 2911 : 46 - 61
  • [5] Leveraging Parallel Corpora and Existing Wordnets for Automatic Construction of the Slovene Wordnet
    Fiser, Darja
    HUMAN LANGUAGE TECHNOLOGY: CHALLENGES OF THE INFORMATION SOCIETY, 2009, 5603 : 359 - 368
  • [6] Research of English-Chinese alignment at word granularity on parallel corpora
    Xu Yang
    Wang Hou-feng
    Lue Xue-qiang
    7TH IEEE/ACIS INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE IN CONJUNCTION WITH 2ND IEEE/ACIS INTERNATIONAL WORKSHOP ON E-ACTIVITY, PROCEEDINGS, 2008, : 223 - +
  • [7] Construction of Chinese Conversational Corpora for Spontaneous Speech Recognition and Comparative Study on the Trilingual Parallel Corpora
    Hu, Xinhui
    Isotani, Ryosuke
    Nakamura, Satoshi
    ORIENTAL COCOSDA 2009 - INTERNATIONAL CONFERENCE ON SPEECH DATABASE AND ASSESSMENTS, 2009, : 56 - 59
  • [8] Automatic construction of parallel English-Chinese corpus for cross-language information retrieval
    Chen, J
    Nie, JY
    6TH APPLIED NATURAL LANGUAGE PROCESSING CONFERENCE/1ST MEETING OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE AND PROCEEDINGS OF THE ANLP-NAACL 2000 STUDENT RESEARCH WORKSHOP, 2000, : 21 - 28
  • [9] Extracting Historical Terms Based on Aligned Chinese-English Parallel Corpora
    Li, Xiuying
    Che, Chao
    Han, Limin
    Liu, Xiaoxia
    IEEE NLP-KE 2009: PROCEEDINGS OF INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING, 2009, : 296 - 301
  • [10] The Construction of Chinese-English Parallel Translation Corpus
    Hu, Weihua
    He, Haizhen
    2017 4TH INTERNATIONAL CONFERENCE ON SYSTEMS AND INFORMATICS (ICSAI), 2017, : 690 - 695