Language Identification of Bengali-English Code-Mixed Data using Character & Phonetic based LSTM Models

被引:4
|
作者
Das, Sourya Dipta [1 ]
Mandal, Soumil [2 ]
Das, Dipankar [1 ]
机构
[1] Jadavpur Univ, Kolkata, India
[2] SRM Univ, Chennai, Tamil Nadu, India
来源
PROCEEDINGS OF THE 11TH ANNUAL MEETING OF THE FORUM FOR INFORMATION RETRIEVAL EVALUATION (FIRE 2019) | 2019年
关键词
code-mixing; code-switching; phonetic encoding; character encoding; language identification;
D O I
10.1145/3368567.3368578
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Language identification of social media text still remains a challenging task due to properties like code-mixing and inconsistent phonetic transliterations. In this paper, we present a supervised learning approach for language identification at the word level of low resource Bengali-English code-mixed data taken from social media. We employ two methods of word encoding, namely character based and root phone based to train our deep LSTM models. Utilizing these two models we created two ensemble models using stacking and threshold technique which gave 91.78% and 92.35% accuracies respectively on our testing data.
引用
收藏
页码:60 / 64
页数:5
相关论文
共 18 条
  • [11] Automatic Language Identification system for code-mixed English-Kannada Social Media Text
    Lakshmi, Sowmya B. S.
    Shambhavi, B. R.
    2017 2ND INTERNATIONAL CONFERENCE ON COMPUTATIONAL SYSTEMS AND INFORMATION TECHNOLOGY FOR SUSTAINABLE SOLUTION (CSITSS-2017), 2017, : 214 - 218
  • [12] An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language in Code-Mixed Social Media Text in English and Roman Hindi
    Shekhar, Shashi
    Sharma, Dilip Kumar
    Beg, M. M. Sufyan
    COMPUTACION Y SISTEMAS, 2020, 24 (04): : 1415 - 1427
  • [13] A Systematic Review on Language Identification of Code-Mixed Text: Techniques, Data Availability, Challenges, and Framework Development
    Hidayatullah, Ahmad Fathan
    Qazi, Atika
    Lai, Daphne Teck Ching
    Apong, Rosyzie Anna
    IEEE ACCESS, 2022, 10 : 122812 - 122831
  • [14] CoLI-Machine Learning Approaches for Code-mixed Language Identification at the Word Level in Kannada-English Texts
    Lakshmaiah, Shashirekha Hosahalli
    Balouchzahi, Fazlourrahman
    Anusha, Mudoor Devadas
    Sidorov, Grigori
    ACTA POLYTECHNICA HUNGARICA, 2022, 19 (10) : 123 - 141
  • [15] Pre-trained language model for code-mixed text in Indonesian, Java']Javanese, and English using transformer
    Hidayatullah, Ahmad Fathan
    Apong, Rosyzie Anna
    Lai, Daphne Teck Ching
    Qazi, Atika
    SOCIAL NETWORK ANALYSIS AND MINING, 2025, 15 (01)
  • [16] Adapter-based fine-tuning of pre-trained multilingual language models for code-mixed and code-switched text classification
    Himashi Rathnayake
    Janani Sumanapala
    Raveesha Rukshani
    Surangika Ranathunga
    Knowledge and Information Systems, 2022, 64 : 1937 - 1966
  • [17] Adapter-based fine-tuning of pre-trained multilingual language models for code-mixed and code-switched text classification
    Rathnayake, Himashi
    Sumanapala, Janani
    Rukshani, Raveesha
    Ranathunga, Surangika
    KNOWLEDGE AND INFORMATION SYSTEMS, 2022, 64 (07) : 1937 - 1966
  • [18] Sentiment Analysis for Egyptian Arabic-English Code-Switched Data Using Traditional Neural Models and Advanced Language Models
    Sherif, Ahmed
    Sabty, Caroline
    SPEECH AND COMPUTER, SPECOM 2024, PT II, 2025, 15300 : 54 - 69