Architectural Synergies in Bi-Modal and Bi-Contrastive Learning

被引:0
|
作者
Gu, Yujia [1 ]
Liu, Brian [2 ]
Zhang, Tianlong [3 ]
Sha, Xinye [4 ]
Chen, Shiyong [5 ]
机构
[1] Calif State Univ Long Beach, Long Beach, CA 90840 USA
[2] Stuyvesant High Sch, New York, NY 10282 USA
[3] Univ Pittsburgh, Pittsburgh, PA 15260 USA
[4] Columbia Univ, New York, NY 10027 USA
[5] Beihang Univ, Beijing 100191, Peoples R China
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Decoding; Visualization; Text to image; Image synthesis; Training; Image coding; Transformers; Linguistics; Multimodal; domain adaption; visual-linguistic model;
D O I
10.1109/ACCESS.2024.3457586
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The integration of visual and linguistic elements within artificial intelligence research is increasingly emphasized, spurred by advancements in pre-trained model technologies. Traditionally, such models have been developed independently, using methods like contrastive learning and image-captioning to boost their analytical and creative outputs. This paper introduces an innovative architecture known as the Zero-shot Unified Image-Text (ZsU-IT) framework, which synthesizes pre-training objectives into a cohesive Unicode-decoder structure. The ZsU-IT is intricately designed with distinct components for image and text processing, coupled with a bi-modal decoder, which seamlessly manages both encoding and decoding tasks across various functions. This dual functionality promotes an effective knowledge transfer between the visual and linguistic modalities, thereby enhancing the system's adaptability and efficiency in tasks like image-to-text translation and vice versa. Rigorous empirical studies reveal that ZsU-IT outstrips prevailing models across multiple applications, including image and text retrieval, image captioning, Visual Question Answering (VQA), and Stanford Natural Language Inference - Visual Entailment (SNLI-VE). This is particularly notable in complex settings involving sophisticated datasets such as medical texts and CT images. In zero-shot environments, ZsU-IT excels, displaying exceptional generalization capabilities. This prowess is highlighted by its significant achievements. The ZsU-IT framework not only sets a new benchmark in the fusion of vision and language technologies but also fosters novel opportunities for both ongoing research and practical implementations. This advancement marks a crucial step forward in the application of integrated multimodal data for complex problem-solving within the artificial intelligence landscape, paving the way for future breakthroughs.
引用
收藏
页码:187128 / 187140
页数:13
相关论文
共 50 条
  • [1] Bi-modal contrastive learning for crop classification using Sentinel-2 and Planetscope
    Patnala, Ankit
    Stadtler, Scarlet
    Schultz, Martin G.
    Gall, Juergen
    FRONTIERS IN REMOTE SENSING, 2024, 5
  • [2] Radar-Camera-based Cross-Modal Bi-Contrastive Learning for Human Motion Recognition
    Chen, Yuh-Shyan
    Cheng, Kuang-Hung
    2024 IEEE WIRELESS COMMUNICATIONS AND NETWORKING CONFERENCE, WCNC 2024, 2024,
  • [3] Bi-modal OnPLS
    Lofstedt, Tommy
    Eriksson, Lennart
    Wormbs, Gunilla
    Trygg, Johan
    JOURNAL OF CHEMOMETRICS, 2012, 26 (06) : 236 - 245
  • [4] BiCLR: Radar-Camera-Based Cross-Modal Bi-Contrastive Learning for Human Motion Recognition
    Chen, Yuh-Shyan
    Cheng, Kuang-Hung
    IEEE SENSORS JOURNAL, 2024, 24 (03) : 4102 - 4119
  • [5] On Bi-Modal Constrained Coding
    Roth, Ron M.
    Siegel, Paul H.
    IEEE TRANSACTIONS ON INFORMATION THEORY, 2021, 67 (03) : 1609 - 1621
  • [6] Bi-modal cohesive energies
    Del Piero, Gianpietro
    Variational Problems in Materials Science, 2006, 68 : 43 - 54
  • [7] Emotion Recognition Based on Meta Bi-Modal Learning Model
    Li Z.
    Sun Y.
    Zhang X.
    Zhou Y.
    Beijing Youdian Daxue Xuebao/Journal of Beijing University of Posts and Telecommunications, 2023, 46 (05): : 87 - 105
  • [8] Bi-modal strategy of gastrulation in reptiles
    Stower, Matthew J.
    Diaz, Raul E.
    Fernandez, Lucia Carrera
    Crother, Mary White
    Crother, Brian
    Marco, Adolfo
    Trainor, Paul A.
    Srinivas, Shankar
    Bertocchini, Federica
    DEVELOPMENTAL DYNAMICS, 2015, 244 (09) : 1144 - 1157
  • [9] BI-MODAL NAIVE SET THEORY
    Wigglesworth, John
    AUSTRALASIAN JOURNAL OF LOGIC, 2018, 15 (02) : 139 - 150
  • [10] Strength and Ductility of Bi-Modal Cu
    Zhao, Yonghao
    Topping, Troy
    Li, Ying
    Lavernia, Enrique J.
    ADVANCED ENGINEERING MATERIALS, 2011, 13 (09) : 865 - 871