Language Identification of Bengali-English Code-Mixed Data using Character & Phonetic based LSTM Models

被引:4
|
作者
Das, Sourya Dipta [1 ]
Mandal, Soumil [2 ]
Das, Dipankar [1 ]
机构
[1] Jadavpur Univ, Kolkata, India
[2] SRM Univ, Chennai, Tamil Nadu, India
来源
PROCEEDINGS OF THE 11TH ANNUAL MEETING OF THE FORUM FOR INFORMATION RETRIEVAL EVALUATION (FIRE 2019) | 2019年
关键词
code-mixing; code-switching; phonetic encoding; character encoding; language identification;
D O I
10.1145/3368567.3368578
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Language identification of social media text still remains a challenging task due to properties like code-mixing and inconsistent phonetic transliterations. In this paper, we present a supervised learning approach for language identification at the word level of low resource Bengali-English code-mixed data taken from social media. We employ two methods of word encoding, namely character based and root phone based to train our deep LSTM models. Utilizing these two models we created two ensemble models using stacking and threshold technique which gave 91.78% and 92.35% accuracies respectively on our testing data.
引用
收藏
页码:60 / 64
页数:5
相关论文
共 18 条
  • [1] Deep Learning-Based Language Identification in English-Hindi-Bengali Code-Mixed Social Media Corpora
    Jamatia, Anupam
    Das, Amitava
    Gambaeck, Bjoern
    JOURNAL OF INTELLIGENT SYSTEMS, 2019, 28 (03) : 399 - 408
  • [2] Abusive Comment Detection from Bengali-English Code-Mixed Social Media Texts Using Ensemble of Deep Learning
    Fahim, Iftekhar
    Ahsan, Shawly
    Hoque, Mohammed Moshiul
    ARTIFICIAL INTELLIGENCE AND KNOWLEDGE PROCESSING, AIKP 2024, 2025, 2228 : 252 - 267
  • [3] Character Embedding for Language Identification in Hindi-English Code-mixed Social Media Text
    Veena, P. V.
    Kumar, M. Anand
    Soman, K. P.
    COMPUTACION Y SISTEMAS, 2018, 22 (01): : 65 - 74
  • [4] Word Level Language Identification in Assamese-Bengali-Hindi-English Code-Mixed Social Media Text
    Sarma, Neelakshi
    Singh, Sanasam Ranbir
    Goswami, Diganta
    2018 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2018, : 261 - 266
  • [5] Transformer Based Language Identification for Malayalam-English Code-Mixed Text
    Thara, S.
    Poornachandran, Prabaharan
    IEEE ACCESS, 2021, 9 : 118837 - 118850
  • [6] Language Detection in Sinhala-English Code-mixed Data
    Smith, Ian
    Thayasivam, Uthayasanker
    PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 228 - 233
  • [7] Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets
    Hidayatullah A.F.
    Apong R.A.
    Lai D.T.C.
    Qazi A.
    PeerJ Computer Science, 2023, 9
  • [8] Language identification framework in code-mixed social media text based on quantum LSTM - the word belongs to which language?
    Shekhar, Shashi
    Sharma, Dilip Kumar
    Beg, M. M. Sufyan
    MODERN PHYSICS LETTERS B, 2020, 34 (06):
  • [9] Corpus creation and language identification for code-mixed Indonesian-Java']Javanese-English Tweets
    Hidayatullah, Ahmad Fathan
    Apong, Rosyzie Anna
    Lai, Daphne T. C.
    Qazi, Atika
    PEERJ COMPUTER SCIENCE, 2023, 9
  • [10] Deep Learning Based Sentiment Analysis in a Code-Mixed English-Hindi and English-Bengali Social Media Corpus
    Jamatia, Anupam
    Swamy, Steve Durairaj
    Gamback, Bjorn
    Das, Amitava
    Debbarma, Swapan
    INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2020, 29 (05)