Multi-Modal Learning with Joint Image-Text Embeddings and Decoder Networks

被引：0

作者：

Chemmanam, Ajai John ^{[1
]}

Jose, Bijoy A. ^{[1
]}

Moopan, Asif ^{[2
]}

机构：

[1] Cochin Univ Sci & Technol, CPS Lab, Dept Elect, Cochin, Kerala, India

[2] Vuelogix Technol Pvt Ltd, Kochi, Kerala, India

来源：

2024 IEEE 7TH INTERNATIONAL CONFERENCE ON INDUSTRIAL CYBER-PHYSICAL SYSTEMS, ICPS 2024 | 2024年

关键词：

Multi-modal learning; Cross-modal retrieval; Encoder-decoder architectures; Computer Vision; Natural Language Processing;

D O I：

10.1109/ICPS59941.2024.10639946

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Advances in machine learning and neural networks have transformed natural language processing (NLP) and computer vision (CV) applications. Recent research efforts have begun to bridge the gap between the two domains. In this work, we propose a semi supervised Multi-Modal Encoder Decoder Network (MMEDN) to capture the relationship between images and textual descriptions, allowing us to generate meaningful descriptions of images and retrieve images from a database using cross-modality search. The semi-supervised training approach, which combines ground truth text descriptions and pseudotext generated by the text decoder within the model, requires far fewer image-text pairs in the training data and can directly add new raw images without manual text labelling for training. This approach is particularly useful for active learning environments, where labels are expensive and hard to obtain. We show that our model performs well with qualitative evaluations. We applied our model for finding images of a person from large databases and generating descriptions of people involved in an event for adding to an automatically generated report. The model was able to retrieve relevant images and generate accurate descriptions, demonstrating its applicability to more practical use cases.

引用

页数：6

共 21 条

[1]

Chemmanam A. J., 2021, RESP DAT SCI SEL P I, P155

[2] A Multi Tasking Model for Object Detection, Instance Segmentation and Keypoint Estimation Tasks [J].

Chemmanam, Ajai John ;

Jose, Bijoy A. ;

Moopan, Asif .

JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2023, 39 (03) :549-560

[3] Fused features for no reference image quality assessment [J].

Chemmanam, Ajai John ;

Shahanaz, N. ;

Jose, Bijoy A. .

IMAGING SCIENCE JOURNAL, 2022, 70 (05) :287-299

[4]

Chen X, 2022, Arxiv, DOI [arXiv:2209.06794, DOI 10.48550/ARXIV.2209.06794]

[5] VirTex: Learning Visual Representations from Textual Annotations [J].

Desai, Karan ;

Johnson, Justin .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :11157-11168

[6]

Devlin J, 2019, Arxiv, DOI arXiv:1810.04805

[7]

Gokaslan A., 2019, OPENWEBTEXT CORPUS

[8]

Harvey J., 2021, Exposing.ai

[9]

Jia C, 2021, PR MACH LEARN RES, V139

[10]

Kuo WC, 2023, Arxiv, DOI arXiv:2303.16839

← 1 2 3 →