Codebook Transfer with Part-of-Speech for Vector-Quantized Image Modeling

被引:0
|
作者
Zhang, Baoquan [1 ]
Wang, Huaibin [1 ]
Luo, Chuyao [1 ]
Li, Xutao [1 ,3 ]
Liang, Guotao [1 ,3 ]
Ye, Yunming [1 ,3 ]
Qi, Xiaochen [2 ]
He, Yao [2 ]
机构
[1] Harbin Inst Technol, Shenzhen, Peoples R China
[2] ShenZhen SiFar Co Ltd, Shenzhen, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
来源
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024 | 2024年
关键词
D O I
10.1109/CVPR52733.2024.00741
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vector-Quantized Image Modeling (VQIM) is a fundamental research problem in image synthesis, which aims to represent an image with a discrete token sequence. Existing studies effectively address this problem by learning a discrete codebook from scratch and in a code-independent manner to quantize continuous representations into discrete tokens. However, learning a codebook from scratch and in a code-independent manner is highly challenging, which may be a key reason causing codebook collapse, i.e., some code vectors can rarely be optimized without regard to the relationship between codes and good codebook priors such that die off finally. In this paper, inspired by pretrained language models, we find that these language models have actually pretrained a superior codebook via a large number of text corpus, but such information is rarely exploited in VQIM. To this end, we propose a novel codebook transfer framework with part-of-speech, called VQCT, which aims to transfer a well-trained codebook from pretrained language models to VQIM for robust codebook learning. Specifically, we first introduce a pretrained codebook from language models and part-of-speech knowledge as priors. Then, we construct a vision-related codebook with these priors for achieving codebook transfer. Finally, a novel codebook transfer network is designed to exploit abundant semantic relationships between codes contained in pretrained codebooks for robust VQIM codebook learning. Experimental results on four datasets show that our VQCT method achieves superior VQIM performance over previous state-of-the-art methods.
引用
收藏
页码:7757 / 7766
页数:10
相关论文
共 39 条
  • [21] Modeling Part-Of-Speech and Semantic Significance Effects on Semantic Construction During Reading
    Al Madi, Naser S.
    Khan, Javed I.
    2019 13TH IEEE INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC), 2019, : 162 - 165
  • [22] StylerDALLE: Language-Guided Style Transfer Using a Vector-Quantized Tokenizer of a Large-Scale Generative Model
    Xu, Zipeng
    Sangineto, Enver
    Sebe, Nicu
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 7567 - 7577
  • [23] Image Captioning Model Using Part-of-Speech Guidance Module for Description With Diverse Vocabulary
    Bae, Ju-Won
    Lee, Soo-Hwan
    Kim, Won-Yeol
    Seong, Ju-Hyeon
    Seo, Dong-Hoan
    IEEE ACCESS, 2022, 10 : 45219 - 45229
  • [24] PARAMETER REESTIMATION IN SEMICONTINUOUS HIDDEN MARKOV MODELING OF SPEECH WITH FEEDBACK TO VECTOR QUANTIZATION CODEBOOK
    HUANG, XD
    JACK, MA
    ARIKI, Y
    ELECTRONICS LETTERS, 1988, 24 (22) : 1375 - 1377
  • [25] Context Dependent Word Modeling for Statistical Machine Translation Using Part-of-Speech Tags
    Sarikaya, Ruhi
    Deng, Yonggang
    Gao, Yuqing
    INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2201 - 2204
  • [26] Leveraging topic modeling and part-of-speech tagging to support combinational creativity in requirements engineering
    Tanmay Bhowmik
    Nan Niu
    Juha Savolainen
    Anas Mahmoud
    Requirements Engineering, 2015, 20 : 253 - 280
  • [27] Language Modeling Using Part-of-speech and Long Short-Term Memory Networks
    Norouzi, Sanaz Saki
    Akbari, Ahmad
    Nasersharif, Babak
    2019 9TH INTERNATIONAL CONFERENCE ON COMPUTER AND KNOWLEDGE ENGINEERING (ICCKE 2019), 2019, : 182 - 187
  • [28] Leveraging topic modeling and part-of-speech tagging to support combinational creativity in requirements engineering
    Bhowmik, Tanmay
    Niu, Nan
    Savolainen, Juha
    Mahmoud, Anas
    REQUIREMENTS ENGINEERING, 2015, 20 (03) : 253 - 280
  • [29] A Chinese text-to-speech system based on part-of-speech analysis, prosodic modeling and non-uniform units
    Chou, FC
    Tseng, CY
    Chen, KJ
    Lee, LS
    1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS, 1997, : 923 - 926
  • [30] Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes
    Bond-Taylor, Sam
    Hessey, Peter
    Sasaki, Hiroshi
    Breckon, Toby P.
    Willcocks, Chris G.
    COMPUTER VISION, ECCV 2022, PT XXIII, 2022, 13683 : 170 - 188