Are Word Embedding Methods Stable and Should We Care About It

被引:6
作者
Borah, Angana [1 ]
Barman, Manash Pratim [2 ]
Awekar, Amit [3 ]
机构
[1] Natl Inst Technol Silchar, Silchar, Assam, India
[2] GAN Studio Inc, New Delhi, India
[3] Indian Inst Technol Guwahati, Gauhati, Assam, India
来源
PROCEEDINGS OF THE 32ND ACM CONFERENCE ON HYPERTEXT AND SOCIAL MEDIA (HT '21) | 2021年
关键词
NLP; word embedding; stability evaluation;
D O I
10.1145/3465336.3475098
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A representation learning method is considered stable if it consistently generates similar representation of the given data across multiple runs. Word Embedding Methods (WEMs) are a class of representation learning methods that generate dense vector representation for each word in the given text data. The central idea of this paper is to explore the stability measurement of WEMs using intrinsic evaluation based on word similarity. We experiment with three popular WEMs: Word2Vec, GloVe, and fastText. For stability measurement, we investigate the effect of five parameters involved in training these models. We perform experiments using four real-world datasets from different domains: Wikipedia, News, Song lyrics, and European parliament proceedings. We also observe the effect of WEM stability on two downstream tasks: Clustering and Fairness evaluation. Our experiments indicate that amongst the three WEMs, fastText is the most stable, followed by GloVe and Word2Vec.
引用
收藏
页码:45 / 55
页数:11
相关论文
共 31 条
  • [1] [Anonymous], 2015, Transactions of the Association for Computational Linguistics, DOI DOI 10.1186/1472-6947-15-S2-S2.ARXIV:1103.0398
  • [2] Antoniak M., 2018, Trans. Assoc. Comput. Linguist., V6, P107, DOI [DOI 10.1162/TACL_A_00008, 10.1162/tacl_a_00008]
  • [3] Barman Manash Pratim, 2019, Advances in Information Retrieval. 41st European Conference on IR Research, ECIR 2019. Proceedings: Lecture Notes in Computer Science (LNCS 11438), P30, DOI 10.1007/978-3-030-15719-7_4
  • [4] Latent Dirichlet allocation
    Blei, DM
    Ng, AY
    Jordan, MI
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
  • [5] Bojanowski P., 2017, Transactions of the Association for Computational Linguistics, V5, P135, DOI [DOI 10.1162/TACLA00051, 10.1162/tacla00051]
  • [6] Multimodal Distributional Semantics
    Bruni, Elia
    Nam Khanh Tran
    Baroni, Marco
    [J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2014, 49 : 1 - 47
  • [7] Semantics derived automatically from language corpora contain human-like biases
    Caliskan, Aylin
    Bryson, Joanna J.
    Narayanan, Arvind
    [J]. SCIENCE, 2017, 356 (6334) : 183 - 186
  • [8] Stability of Word Embeddings Using Word2Vec
    Chugh, Mansi
    Whigham, Peter A.
    Dick, Grant
    [J]. AI 2018: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, 11320 : 812 - 818
  • [9] DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO
  • [10] 2-9