FinEst BERT and CroSloEngual BERT Less Is More in Multilingual Models

被引：19

作者：

Ulcar, Matej ^{[1
]}

Robnik-Sikonja, Marko ^{[1
]}

机构：

[1] Univ Ljubljana, Fac Comp & Informat Sci, Vecna Pot 113, Ljubljana, Slovenia

来源：

TEXT, SPEECH, AND DIALOGUE (TSD 2020) | 2020年 / 12284卷

关键词：

Contextual embeddings; BERT model; Less-resourced languages; NLP;

D O I：

10.1007/978-3-030-58323-1_11

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. The research has been mostly focused on English language, though. While massively multilingual models exist, studies have shown that monolingual models produce much better results. We train two trilingual BERT-like models, one for Finnish, Estonian, and English, the other for Croatian, Slovenian, and English. We evaluate their performance on several downstream tasks, NER, POS-tagging, and dependency parsing, using the multilingual BERT and XLM-R as baselines. The newly created FinEst BERT and CroSloEngual BERT improve the results on all tasks in most monolingual and cross-lingual situations.

引用

页码：104 / 111

页数：8

共 20 条

[1]

Agic Zeljko, 2015, 5 WORKSHOP BALTO SLA, P1

[2]

[Anonymous], 2019, ARXIV191103894

[3]

Arkhipov M, 2019, 7TH WORKSHOP ON BALTO-SLAVIC NATURAL LANGUAGE PROCESSING (BSNLP'2019), P89

[4]

Conneau Alexis, 2019, arXiv preprint arXiv:1911.02116, DOI DOI 10.18653/V1/2020.ACL-MAIN.747

[5]

Dan Kondratyuk, 2019, EMNLP, P2779, DOI 10.18653/v1/D19 -1279

[6]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[7]

Dobrovoljc K., 2017, P 6 WORKSH BALT SLAV

[8]

Haverinen K., 2013, LREC, V48, P493

[9]

Krek S., 2019, Training corpus ssj500k 2.2. Slovenian language resource repository CLARIN.SI.

[10]

Laur S., 2013, Nimeuksuste Korpus

← 1 2 →