An effective cybernated word embedding system for analysis and language identification in code-mixed social media text

被引:4
作者
Shekhar, Shashi [1 ]
Sharma, Dilip Kumar [1 ]
Beg, M. M. Sufyan [2 ]
机构
[1] GLA Univ, Dept Comp Engn & Applicat, Mathura 281406, India
[2] Aligarh Muslim Univ, Dept Comp Engn, Aligarh 202002, Uttar Pradesh, India
关键词
Language identification; transliteration; character embedding; word embedding; Natural Language Processing; cBoW; skip-gram;
D O I
10.3233/KES-190409
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The language used by the users in social media nowadays is Code-mixed text, i.e., mixing of two or more languages. This paper describes the application of the code mixed index in Indian social media texts and comparing the complexity to identify language at word level using Bi-directional Long Short Term Memory model. Social media platforms are now widely used by people to express their opinion and interest. The major contribution of the work is to propose a technique for identifying the language of Hindi-English code-mixed data used in three social media platforms namely, Facebook, Twitter, and WhatsApp. We recommend a deep learning framework based on cBoW and Skip gram model that predicts the origin of the word from language perspective in the sequence based on the specific words that have come before it in the sequence. The context capture module of the system gives better accuracy for word embedding model as compared to character embedding.
引用
收藏
页码:167 / 179
页数:13
相关论文
共 29 条
  • [1] Alekseev A, 2017, COMPUT SIST, V21, P203, DOI [10.13053/cys-21-2-2734, 10.13053/CyS-21-2-2734]
  • [2] [Anonymous], P LREC
  • [3] [Anonymous], 2014, PROC 28 PACIFIC ASIA
  • [4] [Anonymous], 2016, CEUR WORKSHOP PROC
  • [5] [Anonymous], LANGUAGE RESOURCES E
  • [6] [Anonymous], 2016, FIRE WORKING NOTES
  • [7] [Anonymous], 2015, FIRE WORKSHOPS
  • [8] [Anonymous], 2015, FIRE Workshops
  • [9] [Anonymous], CEUR WORKSHOP P
  • [10] [Anonymous], 2016, P ICON