IDENTIC Corpus: Morphologically Enriched Indonesian - English Parallel Corpus

被引:0
作者
Larasati, Septina Dian [1 ]
机构
[1] Charles Univ Prague, Fac Math & Phys, Inst Formal & Appl Linguist, Prague, Czech Republic
来源
LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2012年
关键词
Indonesian; Corpus; Morphology;
D O I
暂无
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
This paper describes the creation process of an Indonesian-English parallel corpus (IDENTIC). The corpus contains 45,000 sentences collected from different sources in different genres. Several manual text preprocessing tasks, such as alignment and spelling correction, are applied to the corpus to assure its quality. We also apply language specific text processing such as tokenization on both sides and clitic normalization on the Indonesian side. The corpus is available in two different formats: 'plain', stored in text format and 'morphologically enriched', stored in CoNLL format. Some parts of the corpus are publicly available at the IDENTIC homepage.
引用
收藏
页码:902 / 906
页数:5
相关论文
共 13 条
  • [1] [Anonymous], COMPUTATIONAL LINGUI
  • [2] BPPT, 2010, RES REP CORP DES COL
  • [3] Buchholz S, 2006, P 10 C COMP NAT LANG, P149, DOI [10.33218/001c.13521, DOI 10.33218/001C.13521, DOI 10.3115/1596276.1596305]
  • [4] Hajic J., 2004, INSIGHT SLOVAK CZECH, P54
  • [5] Indradjaja L.S., 2003, P 17 PAC AS C LANG I
  • [6] Koehn P., 2007, ACL
  • [7] Larasati SD, 2011, COMM COM INF SC, V100, P119
  • [8] Nazief B., 2000, DEV COMPUTATIONAL LI
  • [9] Pajas P., 2008, P 22 INT C COMP LING, V1, P673
  • [10] Pisceldo F, 2008, AUSTR LANG TECHN ASS, V6, P142