Constructing a Chinese-Vietnamese bilingual corpus from subtitle websites

被引:0
|
作者
Nguyen, Phuc-Nghi [1 ]
Tran, Phuoc [1 ]
机构
[1] Natural Language Processing and Knowledge Discovery Laboratory, Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City
关键词
Chinese-Vietnamese bilingual corpus; machine translation; Netflix;
D O I
10.1504/IJIIDS.2024.141748
中图分类号
学科分类号
摘要
In this work, we introduce a method of constructing a Chinese-Vietnamese bilingual corpus on subtitle resources. The corpus construction process involved careful curation and preprocessing of the chosen subtitle data to ensure its suitability for training and evaluating machine translation models. We applied rigorous quality control measures to enhance the reliability and relevance of the collected corpus by systematically eliminating entries that did not meet a predetermined level of correctness. We use the two robust neural machine translation models to experiment on the collected corpus. The experimental results show that the highest BLEU score of the collected corpus is 22.0, much higher than the OpenSubtitles 2016 corpus – one of the most popular subtitle corpus today. By curating a specialised corpus, we aim to contribute valuable resources to the field of machine translation, fostering advancements in the understanding and improvement of translation quality between Chinese and Vietnamese. Copyright © 2024 Inderscience Enterprises Ltd.
引用
收藏
页码:385 / 408
页数:23
相关论文
共 6 条
  • [1] A Method of Chinese-Vietnamese Bilingual Corpus Construction for Machine Translation
    Tran, Phuoc
    Nguyen, Thien
    Vu, Dinh-Hong
    Tran, Huu-Anh
    Vo, Bay
    IEEE ACCESS, 2022, 10 : 78928 - 78938
  • [2] Improving Parallel Corpus Quality for Chinese-Vietnamese Statistical Machine Translation
    Huu-anh Tran
    Yuhang Guo
    Ping Jian
    Shumin Shi
    Heyan Huang
    Journal of Beijing Institute of Technology, 2018, 27 (01) : 127 - 136
  • [3] Improving Parallel Corpus Quality for Chinese-Vietnamese Statistical Machine Translation
    Tran H.-A.
    Guo Y.
    Jian P.
    Shi S.
    Huang H.
    Journal of Beijing Institute of Technology (English Edition), 2018, 27 (01): : 127 - 136
  • [4] On the Design of Web Crawlers for Constructing an Efficient Chinese-Portuguese Bilingual Corpus System
    Cheong, Sio Tai
    Xu, Jiabo
    Liu, Yue
    2018 INTERNATIONAL CONFERENCE ON ELECTRONICS, INFORMATION, AND COMMUNICATION (ICEIC), 2018, : 9 - 12
  • [5] Constructing High Quality Bilingual Corpus using Parallel Data from the Web
    Cheok, Sai Man
    Hoi, Lap Man
    Tang, Su-Kit
    Tse, Rita
    PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON INTERNET OF THINGS, BIG DATA AND SECURITY (IOTBDS), 2022, : 127 - 132
  • [6] The extraction method used for English-Chinese machine translation corpus based on bilingual sentence pair coverage
    Dang, Penghua
    OPEN COMPUTER SCIENCE, 2024, 14 (01):