Architectural Synergies in Bi-Modal and Bi-Contrastive Learning

被引：0

作者：

Gu, Yujia ^{[1
]}

Liu, Brian ^{[2
]}

Zhang, Tianlong ^{[3
]}

Sha, Xinye ^{[4
]}

Chen, Shiyong ^{[5
]}

机构：

[1] Calif State Univ Long Beach, Long Beach, CA 90840 USA

[2] Stuyvesant High Sch, New York, NY 10282 USA

[3] Univ Pittsburgh, Pittsburgh, PA 15260 USA

[4] Columbia Univ, New York, NY 10027 USA

[5] Beihang Univ, Beijing 100191, Peoples R China

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Decoding; Visualization; Text to image; Image synthesis; Training; Image coding; Transformers; Linguistics; Multimodal; domain adaption; visual-linguistic model;

D O I：

10.1109/ACCESS.2024.3457586

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The integration of visual and linguistic elements within artificial intelligence research is increasingly emphasized, spurred by advancements in pre-trained model technologies. Traditionally, such models have been developed independently, using methods like contrastive learning and image-captioning to boost their analytical and creative outputs. This paper introduces an innovative architecture known as the Zero-shot Unified Image-Text (ZsU-IT) framework, which synthesizes pre-training objectives into a cohesive Unicode-decoder structure. The ZsU-IT is intricately designed with distinct components for image and text processing, coupled with a bi-modal decoder, which seamlessly manages both encoding and decoding tasks across various functions. This dual functionality promotes an effective knowledge transfer between the visual and linguistic modalities, thereby enhancing the system's adaptability and efficiency in tasks like image-to-text translation and vice versa. Rigorous empirical studies reveal that ZsU-IT outstrips prevailing models across multiple applications, including image and text retrieval, image captioning, Visual Question Answering (VQA), and Stanford Natural Language Inference - Visual Entailment (SNLI-VE). This is particularly notable in complex settings involving sophisticated datasets such as medical texts and CT images. In zero-shot environments, ZsU-IT excels, displaying exceptional generalization capabilities. This prowess is highlighted by its significant achievements. The ZsU-IT framework not only sets a new benchmark in the fusion of vision and language technologies but also fosters novel opportunities for both ongoing research and practical implementations. This advancement marks a crucial step forward in the application of integrated multimodal data for complex problem-solving within the artificial intelligence landscape, paving the way for future breakthroughs.

引用

页码：187128 / 187140

页数：13

共 50 条

[1] Bi-modal contrastive learning for crop classification using Sentinel-2 and Planetscope
Patnala, Ankit
Stadtler, Scarlet
Schultz, Martin G.
Gall, Juergen
FRONTIERS IN REMOTE SENSING, 2024, 5
[2] Radar-Camera-based Cross-Modal Bi-Contrastive Learning for Human Motion Recognition
Chen, Yuh-Shyan
Cheng, Kuang-Hung
2024 IEEE WIRELESS COMMUNICATIONS AND NETWORKING CONFERENCE, WCNC 2024, 2024,
[3] Bi-modal OnPLS
Lofstedt, Tommy
Eriksson, Lennart
Wormbs, Gunilla
Trygg, Johan
JOURNAL OF CHEMOMETRICS, 2012, 26 (06) : 236 - 245
[4] BiCLR: Radar-Camera-Based Cross-Modal Bi-Contrastive Learning for Human Motion Recognition
Chen, Yuh-Shyan
Cheng, Kuang-Hung
IEEE SENSORS JOURNAL, 2024, 24 (03) : 4102 - 4119
[5] On Bi-Modal Constrained Coding
Roth, Ron M.
Siegel, Paul H.
IEEE TRANSACTIONS ON INFORMATION THEORY, 2021, 67 (03) : 1609 - 1621
[6] Bi-modal cohesive energies
Del Piero, Gianpietro
Variational Problems in Materials Science, 2006, 68 : 43 - 54
[7] Emotion Recognition Based on Meta Bi-Modal Learning Model
Li Z.
Sun Y.
Zhang X.
Zhou Y.
Beijing Youdian Daxue Xuebao/Journal of Beijing University of Posts and Telecommunications, 2023, 46 (05): : 87 - 105
[8] Bi-modal strategy of gastrulation in reptiles
Stower, Matthew J.
Diaz, Raul E.
Fernandez, Lucia Carrera
Crother, Mary White
Crother, Brian
Marco, Adolfo
Trainor, Paul A.
Srinivas, Shankar
Bertocchini, Federica
DEVELOPMENTAL DYNAMICS, 2015, 244 (09) : 1144 - 1157
[9] BI-MODAL NAIVE SET THEORY
Wigglesworth, John
AUSTRALASIAN JOURNAL OF LOGIC, 2018, 15 (02) : 139 - 150
[10] Strength and Ductility of Bi-Modal Cu
Zhao, Yonghao
Topping, Troy
Li, Ying
Lavernia, Enrique J.
ADVANCED ENGINEERING MATERIALS, 2011, 13 (09) : 865 - 871

← 1 2 3 4 5 →