Automatic Acquisition of Large-scale Academic Bilingual Parallel Corpus from the Web

被引:0
|
作者
Han Yong [1 ]
Li Yu [1 ]
He Xiaoning [1 ]
Yang Muyun
Lei Guohua [1 ]
机构
[1] Heilongjiang Inst Technol, Comp Sci & Technol Dept, Harbin, Peoples R China
来源
2009 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING | 2009年
基金
中国国家自然科学基金;
关键词
data mining; bilingual parallel corpora acquision; bilingual term acquision;
D O I
10.1109/IALP.2009.75
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we describe a system which automatically acquires large-scale Chinese-English bilingual parallel corpus from China Journals Full-text Database (CJFD), a component of China National Knowledge Infrastructure (CNKI). The system gets large amount of parallel texts with domain information from the existing structured bilingual texts in CJFD, such as Chinese and English abstracts and titles of academic articles. The acquired Chinese-English parallel corpus is by several orders of magnitudes larger than similar corpus we have known before. In addition, this system collects a large amount of bilingual terms which can directly apply to lexical acquisition.
引用
收藏
页码:318 / 321
页数:4
相关论文
共 23 条
  • [1] Large-scale parallel data clustering
    Judd, D
    McKinley, PK
    Jain, AK
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1998, 20 (08) : 871 - 876
  • [2] AceKG: A Large-scale Knowledge Graph for Academic Data Mining
    Wang, Ruijie
    Yan, Yuchen
    Wang, Jialu
    Jia, Yuting
    Zhang, Ye
    Zhang, Weinan
    Wang, Xinbing
    CIKM'18: PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2018, : 1487 - 1490
  • [3] Mining Botnet Behaviors on the Large-scale Web Application Community
    Garant, Dan
    Lu, Wei
    2013 IEEE 27TH INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS WORKSHOPS (WAINA), 2013, : 185 - 190
  • [4] Mining pinyin-to-character conversion rules from large-scale corpus: A rough set approach
    Wang, XL
    Chen, QC
    Yeung, DS
    IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS, 2004, 34 (02): : 834 - 844
  • [5] Multiobjective Clustering with Automatic k-determination for Large-scale Data
    Matake, Nobukazu
    Hiroyasu, Tomoyuki
    Miki, Mitsunori
    Senda, Tomoharu
    GECCO 2007: GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE, VOL 1 AND 2, 2007, : 861 - +
  • [6] Parallel algorithms for clustering high-dimensional large-scale datasets
    Nagesh, H
    Goil, S
    Choudhary, A
    DATA MINING FOR SCIENTIFIC AND ENGINEERING APPLICATIONS, 2001, 2 : 335 - 356
  • [7] Large-Scale Multidimensional Data Visualization: A Web Service for Data Mining
    Dzemyda, Gintautas
    Marcinkevicius, Virginijus
    Medvedev, Viktor
    TOWARDS A SERVICE-BASED INTERNET, 2011, 6994 : 14 - 25
  • [8] A knowledge discovery methodology for behavior analysis of large-scale applications on parallel architectures
    Houstis, EN
    Verkios, VS
    Catlin, AC
    Rice, JR
    COMPUTATIONAL SCIENCE - ICCS 2003, PT IV, PROCEEDINGS, 2003, 2660 : 739 - 748
  • [9] HiEnCor: on Mining of a Hi-En General Purpose Parallel Corpus from the Web
    Das, Arjun
    Garain, Utpal
    Kumar, Ravindra
    Senapati, Apurbalal
    2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 235 - 238
  • [10] Large-Scale Architectural Asset Extraction from Panoramic Imagery
    Zhu, Peihao
    Para, Wamiq Reyaz
    Fruhstuck, Anna
    Femiani, John
    Wonka, Peter
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2022, 28 (02) : 1301 - 1316