String Kernel-Based Techniques for Native Language Identification

被引：0

作者：

Vamshi Kumar Gurram

J. Sanil

V. S. Anoop

S. Asharaf

机构：

[1] Kerala University of Digital Sciences,School of Computer Science and Engineering

[2] Innovation and Technology,School of Digital Sciences

[3] Kerala University of Digital Sciences,undefined

[4] Innovation and Technology,undefined

来源：

Human-Centric Intelligent Systems | 2023年 / 3卷 / 3期

关键词：

Kernel methods; Native language identification; String kernel; Text feature extraction;

D O I：

10.1007/s44230-023-00029-z

中图分类号：

学科分类号：

摘要：

In recent years, Native Language Identification (NLI) has shown significant interest in computational linguistics. NLI uses an author’s speech or writing in a second language to figure out their native language. This may find applications in forensic linguistics, language teaching, second language acquisition, authorship attribution, identification of spam emails or phishing websites, etc. Conventional pairwise string comparison techniques are computationally expensive and time-consuming. This paper presents fast NLI techniques based on string kernels such as spectrum, presence bits, and intersection string kernels incorporating different learners such as a Support Vector Machine (SVM), Random Forest (RF), and Extreme Gradient Boosting-XGBoost (XGB). Feature sets for the proposed techniques are generated using different combinations of features such as n-word grams and noun phrases. Experimental analyses are carried out using 8235 English as a second language articles from 10 different linguistic backgrounds from a typical NLP benchmark dataset. The experimental results show that the proposed NLI technique incorporating a spectrum string kernel with an RF classifier outperformed existing character n-gram string kernels incorporating SVM, RF, and XGB classifiers. Also, comparable results were observed among different combinations of string kernels. Interestingly, the random forest classifier outperformed SVM and XGB classifiers with different feature sets. All the proposed NLI techniques demonstrated promising results with significant improvement in training time, with the best result attaining more than a 95 percent decrease in training time. The reduced training time of proposed techniques makes it well suited to scale NLI applications for production.

引用

页码：402 / 415

页数：13

共 46 条

[1] Khurana D(2023)Natural language processing: State of the art, current trends and challenges Multimed Applicat 82 3713-3744
[2] Koli A(2022)A review of the trends and challenges in adopting natural language processing methods for education feedback analysis IEEE Access 10 56720-56739
[3] Khatter K(2022)A review on authorship attribution in text mining WIREs Computat Statist 15 1584-143
[4] Singh S(2018)Computational forensic linguistics: An overview of computational applications in forensic contexts Language Law / Linguagem e Direito 5 118-22
[5] Shaik T(2020)Multi-lingual scene text detection and language identification Pattern Recognit Lett 138 16-525
[6] Tao X(2016)String kernels for native language identification: Insights from behind the curtains Comput Linguist 42 491-15519
[7] Li Y(2023)Topicstriker: A topic kernels-powered approach for text classification Results Eng 17 55-446
[8] Dann C(2020)Native language identification of fluent and advanced non-native writers ACM Transact Asian Low-Res Lang Informat Process 19 403-1022
[9] McDonald J(2018)Native language identification with classifier stacking and ensembles Computat Linguist 44 993-101
[10] Redmond P(2003)Latent dirichlet allocation J Mach Learn Res 3 92-380

← 1 2 3 4 5 →