Causal and Masked Language Modeling of Java']Javanese Language using Transformer-based Architectures

被引:1
作者
Wongso, Wilson [1 ]
Setiawan, David Samuel [1 ]
Suhartono, Derwin [1 ]
机构
[1] Bina Nusantara Univ, Sch Comp Sci, Comp Sci Dept, Jakarta, Indonesia
来源
13TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER SCIENCE AND INFORMATION SYSTEMS (ICACSIS 2021) | 2021年
关键词
!text type='Java']Java[!/text]nese Language Modeling; Low-resource Languages; Natural Language Understanding; Transformers; Deep Learning;
D O I
10.1109/ICACSIS53237.2021.9631331
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Most natural language understanding breakthroughs occur in popularly spoken languages, while low-resource languages are rarely examined. We pre-trained as well as compared different Transformer-based architectures on the Javanese language. They were trained on causal and masked language modeling tasks, with Javanese Wikipedia documents as corpus, and could then be fine-tuned to downstream natural language understanding tasks. To speed up pre-training, we transferred English word-embeddings, utilized gradual unfreezing of layers, and applied discriminative fine-tuning. We further fine-tuned our models to classify binary movie reviews and find that they were on par with multilingual/cross-lingual Transformers. We release our pre-trained models for others to use, in hopes of encouraging other researchers to work on low-resource languages like Javanese.
引用
收藏
页码:29 / 35
页数:7
相关论文
共 39 条
[1]  
[Anonymous], 2021, What are the top 200 most spoken languages?
[2]  
[Anonymous], Kewarganegaraan, suku bangsa, agama dan bahasa sehari-hari penduduk Indonesia - Hasil Sesunsu Penduduk 2010
[3]  
Bahdanau D., 2015, PROC INT C LEARN REP
[4]  
Brown TB, 2020, ADV NEUR IN, V33
[5]  
Bucilua C., 2006, P 12 ACM SIGKDD IN, P535
[6]  
Conneau A., 2020, UNSUPERVISED CROSS L
[7]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[8]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[9]  
Guillou P, 2020, Gportuguese-2 (portuguese gpt-2 small): a language model for portuguese text generation (and more nlp tasks ... ) ...
[10]  
Hidayatullah Ahmad Fathan, 2020, Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, P317