Text Classification by CEFR Levels Using Machine Learning Methods and the BERT Language Model

被引：3

作者：

Lagutina, N. S. ^{[1
]}

Lagutina, K. V. ^{[1
]}

Brederman, A. M. ^{[1
]}

Kasatkina, N. N. ^{[1
]}

机构：

[1] Demidov Yaroslavl State Univ, Yaroslavl 150003, Russia

来源：

AUTOMATIC CONTROL AND COMPUTER SCIENCES | 2024年 / 58卷 / 07期

关键词：

automatic text processing; text classification; CEFR; BERT;

D O I：

10.3103/S0146411624700329

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper presents a study of the problem of automatic classification of short coherent texts (essays) in English according to the levels of the international CEFR scale. Determining the level of text in a natural language is an important component of assessing a student's knowledge, including checking open tasks in e-learning systems. To solve this problem, vector text models are considered based on the stylometric numerical features of the character, word, and sentence structure levels. The obtained vectors are classified by the standard machine learning classifiers. This article presents the results of the three most successful ones: Support Vector Classifier, Stochastic Gradient Descent Classifier, and LogisticRegression. Precision, comprehensiveness, and the F-measure served as the quality measures. Two open text corpora, CEFR Levelled English Texts and BEA-2019, are chosen for the experiments. The best classification results for six CEFR levels and sublevels from A1 to C2 are shown by the Support Vector Classifier with an F-score of 67% for the CEFR Levelled English Texts. This approach is compared with the application of the BERT language model (six different variants). The best model, bert-base-cased, provided an F-score value of 69%. The analysis of classification errors shows that most of them are between neighboring levels, which is quite understandable from the point of view of the domain. In addition, the quality of classification strongly depends on the text corpus, which demonstrates a significant difference in F-scores during the application of the same text models for different corpora. In general, the results obtained show the effectiveness of automatic text level determination and the possibility of its practical application.

引用

页码：869 / 878

页数：10

共 22 条

[1]

Adamova L.E., 2021, Izv. Kabardino-Balkarskogo Nauchn. Tsentra Ross. Akad. Nauk, P11, DOI [10.35330/1991-6639-2021-2-100-11-29, DOI 10.35330/1991-6639-2021-2-100-11-29]

[2]

Arase Yuki, 2022, P 2022 C EMP METH NA, P6206

[3]

Bryant C, 2019, INNOVATIVE USE OF NLP FOR BUILDING EDUCATIONAL APPLICATIONS, P52

[4] Automatic evaluation of open-ended questions for online learning. A systematic mapping [J].

del Gobbo, Emiliano ;

Guarino, Alfonso ;

Cafarelli, Barbara ;

Grilli, Luca ;

Limone, Pierpaolo .

STUDIES IN EDUCATIONAL EVALUATION, 2023, 77

[5]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[6] Predicting CEFR levels in learners of English: The use of microsystem criterial features in a machine learning approach [J].

Gaillat, Thomas ;

Simpkin, Andrew ;

Ballier, Nicolas ;

Stearns, Bernardo ;

Sousa, Annanda ;

Bouye, Manon ;

Zarrouk, Manel .

RECALL, 2022, 34 (02) :130-146

[7]

Galichev N.V., 2022, 21 MEZHD NAUCHN PRAK, P695

[8] A Survey on Text Classification Algorithms: From Text to Predictions [J].

Gasparetto, Andrea ;

Marcuzzo, Matteo ;

Zangari, Alessandro ;

Albarelli, Andrea .

INFORMATION, 2022, 13 (02)

[9]

Imperial J.M., 2021, BERT embeddings for automatic readability assessment

[10]

Jalota R., 2022, P 17 WORKSH INN US N

← 1 2 3 →