Deep Variational and Structural Hashing

被引:42
作者
Liong, Venice Erin [1 ]
Lu, Jiwen [2 ,3 ]
Duan, Ling-Yu [4 ]
Tan, Yap-Peng [5 ]
机构
[1] Nanyang Technol Univ, Interdisciplinary Grad Sch, Rapid Rich Object Search ROSE Lab, Singapore 639798, Singapore
[2] Tsinghua Univ, State Key Lab Intelligent Technol & Syst, Beijing Natl Res Ctr Informat Sci & Technol BNRis, Beijing 100084, Peoples R China
[3] Tsinghua Univ, Dept Automat, Beijing 100084, Peoples R China
[4] Peking Univ, Inst Digital Media, Beijing 100080, Peoples R China
[5] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Binary codes; Training; Visualization; Semantics; Quantization (signal); Probabilistic logic; Convolution; Scalable image search; fast similarity search; hashing; deep learning; cross-modal retrieval; NEAREST-NEIGHBOR; QUANTIZATION; ALGORITHMS;
D O I
10.1109/TPAMI.2018.2882816
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose a deep variational and structural hashing (DVStH) method to learn compact binary codes for multimedia retrieval. Unlike most existing deep hashing methods which use a series of convolution and fully-connected layers to learn binary features, we develop a probabilistic framework to infer latent feature representation inside the network. Then, we design a struct layer rather than a bottleneck hash layer, to obtain binary codes through a simple encoding procedure. By doing these, we are able to obtain binary codes discriminatively and generatively. To make it applicable to cross-modal scalable multimedia retrieval, we extend our method to a cross-modal deep variational and structural hashing (CM-DVStH). We design a deep fusion network with a struct layer to maximize the correlation between image-text input pairs during the training stage so that a unified binary vector can be obtained. We then design modality-specific hashing networks to handle the out-of-sample extension scenario. Specifically, we train a network for each modality which outputs a latent representation that is as close as possible to the binary codes which are inferred from the fusion network. Experimental results on five benchmark datasets are presented to show the efficacy of the proposed approach.
引用
收藏
页码:580 / 595
页数:16
相关论文
共 75 条
[1]  
Andoni A, 2006, ANN IEEE SYMP FOUND, P459
[2]  
Andoni A, 2015, ADV NEUR IN, V28
[3]  
[Anonymous], P BRIT MACH VIS C
[4]  
[Anonymous], P AS C COMP VIS
[5]  
[Anonymous], 2006, P ONTOIMAGE 2006 LAN
[6]  
[Anonymous], LEARNING MULTIPLE LA
[7]  
[Anonymous], 2017, P IEEE C COMP VIS PA
[8]   Smart Meter Data Aggregation Against Wireless Attacks: A Game-Theoretic Approach [J].
Cao, Yang ;
Duan, Dongliang ;
Yang, Liuqing ;
Sun, Zhi ;
Zhang, Haochuan ;
Yu, Rong .
2016 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2016, :80-85
[9]  
Cao Y, 2017, AAAI CONF ARTIF INTE, P3974
[10]   Deep Visual-Semantic Quantization for Efficient Image Retrieval [J].
Cao, Yue ;
Long, Mingsheng ;
Wang, Jianmin ;
Liu, Shichen .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :916-925