Compressing and interpreting word embeddings with latent space regularization and interactive semantics probing

被引:34
作者
Li, Haoyu [1 ,2 ]
Wang, Junpeng [2 ]
Zheng, Yan [2 ]
Wang, Liang [2 ]
Zhang, Wei [2 ]
Shen, Han-Wei [1 ]
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, GRAV Res Grp, 786 Dreese Lab,2015 Neil Ave, Columbus, OH 43210 USA
[2] Visa Res, Palo Alto, CA USA
基金
美国国家科学基金会;
关键词
High-dimensional data visualization; visual analytics; neural networks; word embedding; VISUAL ANALYTICS; EXPLORATION;
D O I
10.1177/14738716221130338
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Word embedding, a high-dimensional (HD) numerical representation of words generated by machine learning models, has been used for different natural language processing tasks, for example, translation between two languages. Recently, there has been an increasing trend of transforming the HD embeddings into a latent space (e.g. via autoencoders) for further tasks, exploiting various merits the latent representations could bring. To preserve the embeddings' quality, these works often map the embeddings into an even higher-dimensional latent space, making the already complicated embeddings even less interpretable and consuming more storage space. In this work, we borrow the idea of beta VAE to regularize the HD latent space. Our regularization implicitly condenses information from the HD latent space into a much lower-dimensional space, thus compressing the embeddings. We also show that each dimension of our regularized latent space is more semantically salient, and validate our assertion by interactively probing the encoding-level of user-proposed semantics in the dimensions. To the end, we design a visual analytics system to monitor the regularization process, explore the HD latent space, and interpret latent dimensions' semantics. We validate the effectiveness of our embedding regularization and interpretation approach through both quantitative and qualitative evaluations.
引用
收藏
页码:52 / 68
页数:17
相关论文
共 33 条
  • [1] Bojanowski P., 2017, Transactions of the Association for Computational Linguistics, V5, P135, DOI [10.1162/tacl_a_00051, DOI 10.1162/TACL_A_00051, 10.1162/tacla00051]
  • [2] Burgess C. P., 2018, ARXIV
  • [3] Camacho-Collados Jose, 2017, P 11 INT WORKSH SEM, P15, DOI [10.18653/v1/S17-2002, DOI 10.18653/V1/S17-2002]
  • [4] Chen K., 2013, P 1 INT C LEARN REPR, P1
  • [5] Visual Analytics for Explainable Deep Learning
    Choo, Jaegul
    Liu, Shixia
    [J]. IEEE COMPUTER GRAPHICS AND APPLICATIONS, 2018, 38 (04) : 84 - 92
  • [6] Conneau A, 2017, P 2017 C EMP METH NA, P670, DOI [10.18653/v1/D17-1070, DOI 10.18653/V1/D17-1070]
  • [7] Semantic Concept Spaces: Guided Topic Model Refinement using Word-Embedding Projections
    El-Assady, Mennatallah
    Kehlbeck, Rebecca
    Collins, Christopher
    Keim, Daniel
    Deussen, Oliver
    [J]. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2020, 26 (01) : 1001 - 1011
  • [8] Elmqvist N, 2008, IEEE T VIS COMPUT GR, V14, P1141, DOI 10.1109/TVCG.2008.153
  • [9] VATLD: A Visual Analytics System to Assess, Understand and Improve Traffic Light Detection
    Gou, Liang
    Zou, Lincan
    Li, Nanxiang
    Hofmann, Michael
    Shekar, Arvind Kumar
    Wendt, Axel
    Ren, Liu
    [J]. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2021, 27 (02) : 261 - 271
  • [10] Heinrich J., 2013, state of the art reports, P95, DOI DOI 10.2312/CONF/EG2013/STARS/095-116