Codebook Transfer with Part-of-Speech for Vector-Quantized Image Modeling

被引：0

作者：

Zhang, Baoquan ^{[1
]}

Wang, Huaibin ^{[1
]}

Luo, Chuyao ^{[1
]}

Li, Xutao ^{[1
,3
]}

Liang, Guotao ^{[1
,3
]}

Ye, Yunming ^{[1
,3
]}

Qi, Xiaochen ^{[2
]}

He, Yao ^{[2
]}

机构：

[1] Harbin Inst Technol, Shenzhen, Peoples R China

[2] ShenZhen SiFar Co Ltd, Shenzhen, Peoples R China

[3] Peng Cheng Lab, Shenzhen, Peoples R China

来源：

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024 | 2024年

关键词：

D O I：

10.1109/CVPR52733.2024.00741

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vector-Quantized Image Modeling (VQIM) is a fundamental research problem in image synthesis, which aims to represent an image with a discrete token sequence. Existing studies effectively address this problem by learning a discrete codebook from scratch and in a code-independent manner to quantize continuous representations into discrete tokens. However, learning a codebook from scratch and in a code-independent manner is highly challenging, which may be a key reason causing codebook collapse, i.e., some code vectors can rarely be optimized without regard to the relationship between codes and good codebook priors such that die off finally. In this paper, inspired by pretrained language models, we find that these language models have actually pretrained a superior codebook via a large number of text corpus, but such information is rarely exploited in VQIM. To this end, we propose a novel codebook transfer framework with part-of-speech, called VQCT, which aims to transfer a well-trained codebook from pretrained language models to VQIM for robust codebook learning. Specifically, we first introduce a pretrained codebook from language models and part-of-speech knowledge as priors. Then, we construct a vision-related codebook with these priors for achieving codebook transfer. Finally, a novel codebook transfer network is designed to exploit abundant semantic relationships between codes contained in pretrained codebooks for robust VQIM codebook learning. Experimental results on four datasets show that our VQCT method achieves superior VQIM performance over previous state-of-the-art methods.

引用

页码：7757 / 7766

页数：10

共 39 条

[21] Modeling Part-Of-Speech and Semantic Significance Effects on Semantic Construction During Reading
Al Madi, Naser S.
Khan, Javed I.
2019 13TH IEEE INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC), 2019, : 162 - 165
[22] StylerDALLE: Language-Guided Style Transfer Using a Vector-Quantized Tokenizer of a Large-Scale Generative Model
Xu, Zipeng
Sangineto, Enver
Sebe, Nicu
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 7567 - 7577
[23] Image Captioning Model Using Part-of-Speech Guidance Module for Description With Diverse Vocabulary
Bae, Ju-Won
Lee, Soo-Hwan
Kim, Won-Yeol
Seong, Ju-Hyeon
Seo, Dong-Hoan
IEEE ACCESS, 2022, 10 : 45219 - 45229
[24] PARAMETER REESTIMATION IN SEMICONTINUOUS HIDDEN MARKOV MODELING OF SPEECH WITH FEEDBACK TO VECTOR QUANTIZATION CODEBOOK
HUANG, XD
JACK, MA
ARIKI, Y
ELECTRONICS LETTERS, 1988, 24 (22) : 1375 - 1377
[25] Context Dependent Word Modeling for Statistical Machine Translation Using Part-of-Speech Tags
Sarikaya, Ruhi
Deng, Yonggang
Gao, Yuqing
INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2201 - 2204
[26] Leveraging topic modeling and part-of-speech tagging to support combinational creativity in requirements engineering
Tanmay Bhowmik
Nan Niu
Juha Savolainen
Anas Mahmoud
Requirements Engineering, 2015, 20 : 253 - 280
[27] Language Modeling Using Part-of-speech and Long Short-Term Memory Networks
Norouzi, Sanaz Saki
Akbari, Ahmad
Nasersharif, Babak
2019 9TH INTERNATIONAL CONFERENCE ON COMPUTER AND KNOWLEDGE ENGINEERING (ICCKE 2019), 2019, : 182 - 187
[28] Leveraging topic modeling and part-of-speech tagging to support combinational creativity in requirements engineering
Bhowmik, Tanmay
Niu, Nan
Savolainen, Juha
Mahmoud, Anas
REQUIREMENTS ENGINEERING, 2015, 20 (03) : 253 - 280
[29] A Chinese text-to-speech system based on part-of-speech analysis, prosodic modeling and non-uniform units
Chou, FC
Tseng, CY
Chen, KJ
Lee, LS
1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS, 1997, : 923 - 926
[30] Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes
Bond-Taylor, Sam
Hessey, Peter
Sasaki, Hiroshi
Breckon, Toby P.
Willcocks, Chris G.
COMPUTER VISION, ECCV 2022, PT XXIII, 2022, 13683 : 170 - 188

← 1 2 3 4 →