Sliding Window and Parallel LSTM with Attention and CNN for Sentence Alignment on Low-Resource Languages

被引:2
|
作者
Tan, Tien-Ping [1 ]
Lim, Chai Kim [1 ]
Rahman, Wan Rose Eliza Abdul [2 ]
机构
[1] Univ Sains Malaysia, Sch Comp Sci, Gelugor 11800, Penang, Malaysia
[2] Univ Sains Malaysia, Sch Humanities, Gelugor 11800, Penang, Malaysia
来源
关键词
Attention; CNN; LSTM; parallel text; sentence alignment; TEXT;
D O I
10.47836/pjst.30.1.06
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
A parallel text corpus is an important resource for building a machine translation (MT) system. Existing resources such as translated documents, bilingual dictionaries, and translated subtitles are excellent resources for constructing parallel text corpus. A sentence alignment algorithm automatically aligns source sentences and target sentences because manual sentence alignment is resource-intensive. Over the years, sentence alignment approaches have improved from sentence length heuristics to statistical lexical models to deep neural networks. Solving the alignment problem as a classification problem is interesting as classification is the core of machine learning. This paper proposes a parallel long-short-term memory with attention and convolutional neural network (parallel LSTM+Attention+CNN) for classifying two sentences as parallel or non-parallel sentences. A sliding window approach is also proposed with the classifier to align sentences in the source and target languages. The proposed approach was compared with three classifiers, namely the feedforward neural network, CNN, and bi-directional LSTM. It is also compared with the BleuAlign sentence alignment system. The classification accuracy of these models was evaluated using Malay-English parallel text corpus and UN French-English parallel text corpus. The Malay-English sentence alignment performance was then evaluated using research documents and the very challenging Classical Malay-English document. The proposed classifier obtained more than 80% accuracy in categorizing parallel/non-parallel sentences with a model built using only five thousand training parallel sentences. It has a higher sentence alignment accuracy than other baseline systems.
引用
收藏
页码:97 / +
页数:26
相关论文
共 50 条
  • [1] Attention is all low-resource languages need
    Poupard, Duncan
    TRANSLATION STUDIES, 2024, 17 (02) : 424 - 427
  • [2] Exploiting bilingual lexicons to improve multilingual embedding-based document and sentence alignment for low-resource languages
    Aloka Fernando
    Surangika Ranathunga
    Dilan Sachintha
    Lakmali Piyarathna
    Charith Rajitha
    Knowledge and Information Systems, 2023, 65 : 571 - 612
  • [3] Exploiting bilingual lexicons to improve multilingual embedding-based document and sentence alignment for low-resource languages
    Fernando, Aloka
    Ranathunga, Surangika
    Sachintha, Dilan
    Piyarathna, Lakmali
    Rajitha, Charith
    KNOWLEDGE AND INFORMATION SYSTEMS, 2023, 65 (02) : 571 - 612
  • [4] Neural machine translation for low-resource languages without parallel corpora
    Karakanta, Alina
    Dehdari, Jon
    van Genabith, Josef
    MACHINE TRANSLATION, 2018, 32 (1-2) : 167 - 189
  • [5] Cross-Lingual Knowledge Distillation for Answer Sentence Selection in Low-Resource Languages
    Gupta, Shivanshu
    Matsubara, Yoshitomo
    Chadha, Ankit
    Moschitti, Alessandro
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 14078 - 14092
  • [6] Phrase Table Combination Based on Symmetrization of Word Alignment for Low-Resource Languages
    Budiwati, Sari Dewi
    Siagian, Al Hafiz Akbar Maulana
    Fatyanosa, Tirana Noor
    Aritsugi, Masayoshi
    APPLIED SCIENCES-BASEL, 2021, 11 (04): : 1 - 20
  • [7] END-TO-END LOW-RESOURCE LIP-READING WITH MAXOUT CNN AND LSTM
    Fung, Ivan
    Mak, Brian
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 2511 - 2515
  • [8] DRA: dynamic routing attention for neural machine translation with low-resource languages
    Wang, Zhenhan
    Song, Ran
    Yu, Zhengtao
    Mao, Cunli
    Gao, Shengxiang
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2024,
  • [9] Voice Activation for Low-Resource Languages
    Kolesau, Aliaksei
    Sesok, Dmitrij
    APPLIED SCIENCES-BASEL, 2021, 11 (14):
  • [10] Punctuation and Parallel Corpus Based Word Embedding Model for Low-Resource Languages
    Yuan, Yang
    Li, Xiao
    Yang, Ya-Ting
    INFORMATION, 2020, 11 (01)