Pre-training Two BERT-Like Models for Moroccan Dialect: MorRoBERTa and MorrBERT

被引:0
作者
Moussaoui O. [1 ]
El Younoussi Y. [1 ]
机构
[1] Information System and Software Engineering, National School of Applied Sciences, Abdelmalek Essaadi University
关键词
BERT; Machine Learning; Moroccan Dialect; Natural Language Processing; Pre-trained; RoBERTa;
D O I
10.13164/mendel.2023.1.055
中图分类号
学科分类号
摘要
This research article presents a comprehensive study on the pre-training of two language models, MorRoBERTa and MorrBERT, for the Moroccan Dialect, using the Masked Language Modeling (MLM) pre-training approach. The study details the various data collection and pre-processing steps involved in building a large corpus of over six million sentences and 71 billion tokens, sourced from social media platforms such as Facebook, Twitter, and YouTube. The pre-training process was carried out using the HuggingFace Transformers API, and the paper elaborates on the configurations and training methodologies of the models. The study concludes by demonstrating the high accuracy rates achieved by both MorRoBERTa and MorrBERT in multiple downstream tasks, indicating their potential effectiveness in natural language processing applications specific to the Moroccan Dialect. © 2023, Brno University of Technology. All rights reserved.
引用
收藏
页码:55 / 61
页数:6
相关论文
共 32 条
[1]  
High performance computing (hpc)
[2]  
La constitution, (2011)
[3]  
Top most-commented youtube channels in morocco — hypeauditor
[4]  
Abdelali A., Hassan S., Mubarak H., Darwish K., Samih Y., Pre-training bert on arabic tweets: Practical considerations, (2021)
[5]  
Abdul-Mageed M., Elmadany A., Nagoudi E. M. B., Arbert & marbert: deep bidirectional transformers for arabic, (2020)
[6]  
Abdul-Mageed M., Zhang C., Elmadany A., Bouamor H., Habash N., Nadi 2021: The second nuanced arabic dialect identification shared task, (2021)
[7]  
Antoun W., Baly F., Hajj H., Arabert: Transformer-based model for arabic language un-derstanding, (2020)
[8]  
Bhatia S., Sharma M., Bhatia K. K., Sentiment Analysis and Mining of Opinions, pp. 503-523, (2018)
[9]  
Boujou E., Chataoui H., Mekki A. E., Benjelloun S., Chairi I., Berrada I., An open access nlp dataset for arabic dialects: Data collection, labeling, and model construction, (2021)
[10]  
Cho K., Van Merrienboer B., Bahdanau D., Bengio Y., On the properties of neural machine translation: Encoder-decoder ap-proaches, (2014)