POS Tagging without a Tagger: Using Aligned Corpora for Transferring Knowledge to Under-Resourced Languages

被引:2
|
作者
Khemakhem, Ines Turki [1 ]
Jamoussi, Salma [1 ]
Ben Hamadou, Abdelmajid [1 ]
机构
[1] Univ Sfax, MIRACL Lab, Sfax, Tunisia
来源
COMPUTACION Y SISTEMAS | 2016年 / 20卷 / 04期
关键词
POS tagging; alignment; parallel corpus; under-resourced languages;
D O I
10.13053/CyS-20-4-2430
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Almost all languages lack sufficient resources and tools for developing Human Language Technologies (HLT). These technologies are mostly developed for languages for which large resources and tools are available. In this paper, we deal with the under-resourced languages, which can benefit from the available resources and tools to develop their own HLT. We consider as an example the POS tagging task, which is among the most primordial Natural Language Processing tasks. The task is importatn because it assigns to word tags that highlight their morphological features by considering the corresponding contexts. The solution that we propose in this research work, is based on the use of aligned parallel corpus as a bridge between a rich-resourced language and an under-resourced language. This kind of corpus is usually available. The rich-resourced language side of this corpus is annotated first. These POS-annotations are then exploited to predict the annotation on the under-resourced language side by using alignment training. After this training step, we obtain a matching table between the two languages, which is exploited to annotate an input text. The experimentation of the proposed approach is performed for a pair of languages: English as a rich-resourced language and Arabic as an under-resourced language. We used the IWSLT10 training corpus and English TreeTagger [15]. The approach was evaluated on the test corpus extracted from the IWSLT08 and obtained F-score of 89%. It can be extrapolated to the other NLP tasks.
引用
收藏
页码:667 / 679
页数:13
相关论文
共 15 条
  • [1] A Collection of Comparable Corpora for Under-resourced Languages
    Skadina, Inguna
    Aker, Ahmet
    Giouli, Voula
    Tufis, Dan
    Gaizauskas, Robert
    Mierina, Madara
    Mastropavlos, Nikos
    HUMAN LANGUAGE TECHNOLOGIES - THE BALTIC PERSPECTIVE, 2010, 219 : 161 - 168
  • [2] Using machine learning to build POS tagger for under-resourced language: the case of Somali
    Mohammed S.
    International Journal of Information Technology, 2020, 12 (3) : 717 - 729
  • [3] Collecting and annotating corpora for three under-resourced languages of France: Methodological issues
    Bernhard, Delphine
    Ligozat, Anne-Laure
    Bras, Myriam
    Martin, Fanny
    Vergez-Couret, Marianne
    Erhart, Pascale
    Sibille, Jean
    Todirascu, Amalia
    de Mareuil, Philippe Boula
    Huck, Dominique
    LANGUAGE DOCUMENTATION & CONSERVATION, 2021, 15 : 316 - 357
  • [4] ADAPTING ASR FOR UNDER-RESOURCED LANGUAGES USING MISMATCHED TRANSCRIPTIONS
    Liu, Chunxi
    Jyothi, Preethi
    Tang, Hao
    Manohar, Vimal
    Sloan, Rose
    Kekona, Tyler
    Hasegawa-Johnson, Mark
    Khudanpur, Sanjeev
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5840 - 5844
  • [5] WordNet construction for under-resourced languages using personalized PageRank
    Berangi, Parisa
    Mousavi, Zahra
    Faili, Heshaam
    Shakery, Azadeh
    DIGITAL SCHOLARSHIP IN THE HUMANITIES, 2021, 36 (03) : 565 - 580
  • [6] Speech recognition of under-resourced languages using mismatched transcriptions
    Do, Van Hai
    Chen, Nancy F.
    Lim, Boon Pang
    Hasegawa-Johnson, Mark
    PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2016, : 112 - 115
  • [7] Using Resource-Rich Languages to Improve Morphological Analysis of Under-Resourced Languages
    Baumann, Peter
    Pierrehumbert, Janet
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 3355 - 3359
  • [8] Toward a Lightweight Solution for Less-resourced Languages: Creating a POS Tagger for Alsatian Using Voluntary Crowdsourcing
    Millour, Alice
    Fort, Karen
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 455 - 460
  • [9] Building synthetic voices for under-resourced languages: the feasibility of using audiobook data
    de Wet, Febe
    Van der Walt, Willem
    Dlamini, Nkosikhona
    Govender, Avashna
    2017 PATTERN RECOGNITION ASSOCIATION OF SOUTH AFRICA AND ROBOTICS AND MECHATRONICS (PRASA-ROBMECH), 2017, : 225 - 229
  • [10] IMPROVING HMM/DNN IN ASR OF UNDER-RESOURCED LANGUAGES USING PROBABILISTIC SAMPLING
    Song, Meixu
    Zhang, Qingqing
    Pan, Jielin
    Yan, Yonghong
    2015 IEEE CHINA SUMMIT & INTERNATIONAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING, 2015, : 20 - 24