Language Identification of Bengali-English Code-Mixed Data using Character & Phonetic based LSTM Models

被引：4

作者：

Das, Sourya Dipta ^{[1
]}

Mandal, Soumil ^{[2
]}

Das, Dipankar ^{[1
]}

机构：

[1] Jadavpur Univ, Kolkata, India

[2] SRM Univ, Chennai, Tamil Nadu, India

来源：

PROCEEDINGS OF THE 11TH ANNUAL MEETING OF THE FORUM FOR INFORMATION RETRIEVAL EVALUATION (FIRE 2019) | 2019年

关键词：

code-mixing; code-switching; phonetic encoding; character encoding; language identification;

D O I：

10.1145/3368567.3368578

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Language identification of social media text still remains a challenging task due to properties like code-mixing and inconsistent phonetic transliterations. In this paper, we present a supervised learning approach for language identification at the word level of low resource Bengali-English code-mixed data taken from social media. We employ two methods of word encoding, namely character based and root phone based to train our deep LSTM models. Utilizing these two models we created two ensemble models using stacking and threshold technique which gave 91.78% and 92.35% accuracies respectively on our testing data.

引用

页码：60 / 64

页数：5

共 18 条

[11] Automatic Language Identification system for code-mixed English-Kannada Social Media Text
Lakshmi, Sowmya B. S.
Shambhavi, B. R.
2017 2ND INTERNATIONAL CONFERENCE ON COMPUTATIONAL SYSTEMS AND INFORMATION TECHNOLOGY FOR SUSTAINABLE SOLUTION (CSITSS-2017), 2017, : 214 - 218
[12] An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language in Code-Mixed Social Media Text in English and Roman Hindi
Shekhar, Shashi
Sharma, Dilip Kumar
Beg, M. M. Sufyan
COMPUTACION Y SISTEMAS, 2020, 24 (04): : 1415 - 1427
[13] A Systematic Review on Language Identification of Code-Mixed Text: Techniques, Data Availability, Challenges, and Framework Development
Hidayatullah, Ahmad Fathan
Qazi, Atika
Lai, Daphne Teck Ching
Apong, Rosyzie Anna
IEEE ACCESS, 2022, 10 : 122812 - 122831
[14] CoLI-Machine Learning Approaches for Code-mixed Language Identification at the Word Level in Kannada-English Texts
Lakshmaiah, Shashirekha Hosahalli
Balouchzahi, Fazlourrahman
Anusha, Mudoor Devadas
Sidorov, Grigori
ACTA POLYTECHNICA HUNGARICA, 2022, 19 (10) : 123 - 141
[15] Pre-trained language model for code-mixed text in Indonesian, Java']Javanese, and English using transformer
Hidayatullah, Ahmad Fathan
Apong, Rosyzie Anna
Lai, Daphne Teck Ching
Qazi, Atika
SOCIAL NETWORK ANALYSIS AND MINING, 2025, 15 (01)
[16] Adapter-based fine-tuning of pre-trained multilingual language models for code-mixed and code-switched text classification
Himashi Rathnayake
Janani Sumanapala
Raveesha Rukshani
Surangika Ranathunga
Knowledge and Information Systems, 2022, 64 : 1937 - 1966
[17] Adapter-based fine-tuning of pre-trained multilingual language models for code-mixed and code-switched text classification
Rathnayake, Himashi
Sumanapala, Janani
Rukshani, Raveesha
Ranathunga, Surangika
KNOWLEDGE AND INFORMATION SYSTEMS, 2022, 64 (07) : 1937 - 1966
[18] Sentiment Analysis for Egyptian Arabic-English Code-Switched Data Using Traditional Neural Models and Advanced Language Models
Sherif, Ahmed
Sabty, Caroline
SPEECH AND COMPUTER, SPECOM 2024, PT II, 2025, 15300 : 54 - 69

← 1 2 →