Deep Learning-Based Image Retrieval System with Clustering on Attention-Based Representations

被引:0
作者
Rao S.S. [1 ]
Ikram S. [1 ]
Ramesh P. [1 ]
机构
[1] Department of Computer Science, PES University, Karnataka, Bangalore
关键词
FastText; Inception; Multimodal learning; Self-attention;
D O I
10.1007/s42979-021-00563-2
中图分类号
学科分类号
摘要
In the modern era of digital photography and advent of smartphones, millions of images are generated every day and they represent precious moments and events of our lives. As we continue to add images to our digital storehouse, the management and access handling of the images becomes a daunting task, and we lose track unless properly managed. We are in essential need of a tool that can fetch images based on a word or a description. In this paper, we try to build a solution that retrieves relevant images from a pool, based on the description by looking at the content of the image. The model is based on deep neural network architecture and attending to relevant parts of the image. The algorithm takes a sentence or word as input and obtains the top images which are relevant to the caption. We obtain the representation of the sentence and image in a higher dimension, which enables us to compare the two and find the similarity level of both to decide on the relevance. We have conducted various experiments to improve the representation of the image and the caption obtained in the latent space for better correlation, for, e.g., use of bidirectional sequence models for better textual representation, use of various baseline convolution-based stacks for better image representation. We have also tried to incorporate the self-attention mechanism to focus on only the relevant parts of the image and the sentence, thereby enhancing the correlation between the two spaces. © 2021, The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd.
引用
收藏
相关论文
共 25 条
[1]  
Lin T.Y., Maire M., Belongie S., Hays J., Perona P., Ramanan D., Dollar P., Zitnick L., Microsoft COCO: Common objects in context, Computer vision—ECCV, 8693, (2014)
[2]  
Xu K., Ba J., Kiros R., Cho K., Courville A., Salakhudinov R., Zemel R., Bengio Y., Show, attend and tell: Neural image caption generation with visual attention, . In: Proceedings of the 32Nd International Conference on ICML, pp. 2048-2057, (2015)
[3]  
Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser L., Polosukhin I., Attention is all you need, : Advances in Neural Information Processing Systems, pp. 5998-6008, (2017)
[4]  
Weston J., Bengio S., Usunier N., WSABIE: Scaling up to large vocabulary image annotation, IJCAI International Joint Conference on Artificial Intelligence, pp. 2764-2770, (2017)
[5]  
Vendrov I., Kiros R., Fidler S., Urtasun R., Order-embeddings of images and language, International Conference on Learning Representations, (2016)
[6]  
Gu J., Cai J., Joty S., Niu L., Wang G., Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models, In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
[7]  
2018., pp. 7181-7189
[8]  
Chen Y., Wang J.Z., Krovetz R., Content-based image retrieval by clustering, Proceedings of the 5Th ACM SIGMM International Workshop on Multimedia Information Retrieval., pp. 193-200, (2003)
[9]  
Sheikholeslami G., Chang W., Zhang A., SemQuery: semantic clustering and querying on heterogeneous features for visual data, IEEE Trans Knowl Data Eng, 14, 5, pp. 988-1002, (2002)
[10]  
Smith J.R., Chang S.F., VisualSEEK: A fully automated content-based query system, Proc. 4Th Acm int’l Conf. on Multimedia, pp. 87-88, (1996)