Language Identification of Bengali-English Code-Mixed Data using Character & Phonetic based LSTM Models

被引：4

作者：

Das, Sourya Dipta ^{[1
]}

Mandal, Soumil ^{[2
]}

Das, Dipankar ^{[1
]}

机构：

[1] Jadavpur Univ, Kolkata, India

[2] SRM Univ, Chennai, Tamil Nadu, India

来源：

PROCEEDINGS OF THE 11TH ANNUAL MEETING OF THE FORUM FOR INFORMATION RETRIEVAL EVALUATION (FIRE 2019) | 2019年

关键词：

code-mixing; code-switching; phonetic encoding; character encoding; language identification;

D O I：

10.1145/3368567.3368578

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Language identification of social media text still remains a challenging task due to properties like code-mixing and inconsistent phonetic transliterations. In this paper, we present a supervised learning approach for language identification at the word level of low resource Bengali-English code-mixed data taken from social media. We employ two methods of word encoding, namely character based and root phone based to train our deep LSTM models. Utilizing these two models we created two ensemble models using stacking and threshold technique which gave 91.78% and 92.35% accuracies respectively on our testing data.

引用

页码：60 / 64

页数：5

共 18 条

[1] Deep Learning-Based Language Identification in English-Hindi-Bengali Code-Mixed Social Media Corpora
Jamatia, Anupam
Das, Amitava
Gambaeck, Bjoern
JOURNAL OF INTELLIGENT SYSTEMS, 2019, 28 (03) : 399 - 408
[2] Abusive Comment Detection from Bengali-English Code-Mixed Social Media Texts Using Ensemble of Deep Learning
Fahim, Iftekhar
Ahsan, Shawly
Hoque, Mohammed Moshiul
ARTIFICIAL INTELLIGENCE AND KNOWLEDGE PROCESSING, AIKP 2024, 2025, 2228 : 252 - 267
[3] Character Embedding for Language Identification in Hindi-English Code-mixed Social Media Text
Veena, P. V.
Kumar, M. Anand
Soman, K. P.
COMPUTACION Y SISTEMAS, 2018, 22 (01): : 65 - 74
[4] Word Level Language Identification in Assamese-Bengali-Hindi-English Code-Mixed Social Media Text
Sarma, Neelakshi
Singh, Sanasam Ranbir
Goswami, Diganta
2018 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2018, : 261 - 266
[5] Transformer Based Language Identification for Malayalam-English Code-Mixed Text
Thara, S.
Poornachandran, Prabaharan
IEEE ACCESS, 2021, 9 : 118837 - 118850
[6] Language Detection in Sinhala-English Code-mixed Data
Smith, Ian
Thayasivam, Uthayasanker
PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 228 - 233
[7] Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets
Hidayatullah A.F.
Apong R.A.
Lai D.T.C.
Qazi A.
PeerJ Computer Science, 2023, 9
[8] Language identification framework in code-mixed social media text based on quantum LSTM - the word belongs to which language?
Shekhar, Shashi
Sharma, Dilip Kumar
Beg, M. M. Sufyan
MODERN PHYSICS LETTERS B, 2020, 34 (06):
[9] Corpus creation and language identification for code-mixed Indonesian-Java']Javanese-English Tweets
Hidayatullah, Ahmad Fathan
Apong, Rosyzie Anna
Lai, Daphne T. C.
Qazi, Atika
PEERJ COMPUTER SCIENCE, 2023, 9
[10] Deep Learning Based Sentiment Analysis in a Code-Mixed English-Hindi and English-Bengali Social Media Corpus
Jamatia, Anupam
Swamy, Steve Durairaj
Gamback, Bjorn
Das, Amitava
Debbarma, Swapan
INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2020, 29 (05)

← 1 2 →