Word sense induction with agglomerative clustering and mutual information maximization

被引:0
作者
Abdine, Hadi [1 ]
Eddine, Moussa Kamal [1 ]
Buscaldi, Davide [2 ]
Vazirgiannis, Michalis [1 ]
机构
[1] Ecole Polytech, LIX, Palaiseau, France
[2] Univ Sorbonne Paris Nord, LIPN, Paris, France
来源
AI OPEN | 2023年 / 4卷
关键词
Word sense induction; Unsupervised machine learning; Natural language processing; Transformer; BERT; Mutual information; Clustering;
D O I
10.1016/j.aiopen.2023.12.001
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Word sense induction (WSI) is a challenging problem in natural language processing that involves the unsupervised automatic detection of a word's senses (i.e., meanings). Recent work achieves significant results on the WSI task by pre -training a language model that can exclusively disambiguate word senses. In contrast, others employ off-the-shelf pre-trained language models with additional strategies to induce senses. This paper proposes a novel unsupervised method based on hierarchical clustering and invariant information clustering (IIC). The IIC loss is used to train a small model to optimize the mutual information between two vector representations of a target word occurring in a pair of synthetic paraphrases. This model is later used in inference mode to extract a higher-quality vector representation to be used in the hierarchical clustering. We evaluate our method on two WSI tasks and in two distinct clustering configurations (fixed and dynamic number of clusters). We empirically show that our approach is at least on par with the state -of -the -art baselines, outperforming them in several configurations. The code and data to reproduce this work are available to the public 1 .
引用
收藏
页码:193 / 201
页数:9
相关论文
共 40 条
  • [1] Agirre E., 2009, A study on similarity and relatedness using distributional and WordNet-based approaches, P19, DOI 10.3115/1620754.1620758
  • [2] Amplayo RK, 2019, AAAI CONF ARTIF INTE, P6212
  • [3] Amrami A, 2019, Arxiv, DOI arXiv:1905.12598
  • [4] Amrami A, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P4860
  • [5] [Anonymous], 2007, Comput. Linguist
  • [6] [Anonymous], 2014, P COLING 2014 25 INT
  • [7] Ansell A, 2021, 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), P563
  • [8] Antoun W., 2020, P 4 WORKSH OP SOURC, P9
  • [9] Bagga Amit, 1998, ACL '98/COLING ' 98), P79, DOI [DOI 10.3115/980845.980859, 10.3115/980845.980859]
  • [10] The WaCky wide web: a collection of very large linguistically processed web-crawled corpora
    Baroni, Marco
    Bernardini, Silvia
    Ferraresi, Adriano
    Zanchetta, Eros
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2009, 43 (03) : 209 - 226