Inter-Intra Modal Representation Augmentation With DCT-Transformer Adversarial Network for Image-Text Matching

被引:5
作者
Chen, Chen [1 ]
Wang, Dan [1 ]
Song, Bin [1 ]
Tan, Hao [2 ]
机构
[1] Xidian Univ, State Key Lab Integrated Serv Networks, Xian 710071, Peoples R China
[2] Guangdong OPPO Mobile Telecommun Corp, Software Engn Div, Dongguan 523860, Peoples R China
基金
中国国家自然科学基金;
关键词
Feature extraction; Transformers; Discrete cosine transforms; Background noise; Task analysis; Semantics; Visualization; Image-text matching; data augmentation; adversarial learning; transformer; DCT;
D O I
10.1109/TMM.2023.3243665
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Image-text matching has become a challenging task in the multimedia analysis field. Many advanced methods have been used to explore local and global cross-modal correspondence in matching. However, most methods ignore the importance of eliminating potential irrelevant features in the original features of each modality and cross-modal common feature. Moreover, the features extracted from regions in images and words in sentences contain cluttered background noise and different occlusion noise, which negatively affects alignment. Different from these methods, we propose a novel DCT-Transformer Adversarial Network (DTAN) for image-text matching in this paper. This work can obtain an effective metric based on two aspects: i) DCT-Transformer uses DCT (Discrete Cosine Transform) method based on a transformer mechanism to extract multi-domain common representations and eliminate irrelevant features from different modalities (inter-modal). Among them, DCT divides multi-modal content into chunks of different frequencies and quantifies them. ii) The adversarial network introduces an adversary idea by combining the original features of various single modalities and the multi-domain common representation, alleviating the background noise within each modality (intra-modal). The proposed adversarial feature augmentation method can easily obtain the common representation that is only useful for alignment. Extensive experiments are completed on the benchmark datasets Flickr30 K and MS-COCO, demonstrating the superiority of the DTAN model over the state-of-the-art methods.
引用
收藏
页码:8933 / 8945
页数:13
相关论文
共 52 条
  • [1] DISCRETE COSINE TRANSFORM
    AHMED, N
    NATARAJAN, T
    RAO, KR
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 1974, C 23 (01) : 90 - 93
  • [2] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
    Anderson, Peter
    He, Xiaodong
    Buehler, Chris
    Teney, Damien
    Johnson, Mark
    Gould, Stephen
    Zhang, Lei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
  • [3] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
  • [4] Unpaired Image Captioning With semantic-Constrained Self-Learning
    Ben, Huixia
    Pan, Yingwei
    Li, Yehao
    Yao, Ting
    Hong, Richang
    Wang, Meng
    Mei, Tao
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 904 - 916
  • [5] Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Embeddings
    Carvalho, Micael
    Cadene, Remi
    Picard, David
    Soulier, Laure
    Thome, Nicolas
    Cord, Matthieu
    [J]. ACM/SIGIR PROCEEDINGS 2018, 2018, : 35 - 44
  • [6] IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval
    Chen, Hui
    Ding, Guiguang
    Liu, Xudong
    Lin, Zijia
    Liu, Ji
    Han, Jungong
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, : 12652 - 12660
  • [7] Chen TL, 2020, AAAI CONF ARTIF INTE, V34, P10583
  • [8] Learning to Evaluate Image Captioning
    Cui, Yin
    Yang, Guandao
    Veit, Andreas
    Huang, Xun
    Belongie, Serge
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 5804 - 5812
  • [9] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [10] Diao HW, 2021, AAAI CONF ARTIF INTE, V35, P1218