Cross-modal knowledge learning with scene text for fine-grained image classification

被引:0
|
作者
Xiong, Li [1 ,2 ]
Mao, Yingchi [1 ,2 ,5 ]
Wang, Zicheng [1 ,3 ]
Nie, Bingbing [4 ]
Li, Chang [1 ,2 ]
机构
[1] Hohai Univ, Sch Comp & Informat, Nanjing, Peoples R China
[2] Hohai Univ, Minist Water Resources, Key Lab Water Big Data Technol, Nanjing, Peoples R China
[3] Power China Kunming Engn Corp Ltd, Kunming, Yunnan, Peoples R China
[4] Huaneng Lancang River Hydropower Corp Ltd, Kunming, Yunnan, Peoples R China
[5] Hohai Univ, Sch Comp & Informat, Nanjing 210098, Peoples R China
关键词
feature extraction; image classification;
D O I
10.1049/ipr2.13039
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Scene text in natural images carries additional semantic information to aid in image classification. Existing methods lack full consideration of the deep understanding of the text and the visual text relationship, which results in the difficult to judge the semantic accuracy and the relevance of the visual text. This paper proposes image classification based on Cross modal Knowledge Learning of Scene Text (CKLST) method. CKLST consists of three stages: cross-modal scene text recognition, text semantic enhancement, and visual-text feature alignment. In the first stage, multi-attention is used to extract features layer by layer, and a self-mask-based iterative correction strategy is utilized to improve the scene text recognition accuracy. In the second stage, knowledge features are extracted using external knowledge and are fused with text features to enhance text semantic information. In the third stage, CKLST realizes visual-text feature alignment across attention mechanisms with a similarity matrix, thus the correlation between images and text can be captured to improve the accuracy of the image classification tasks. On Con-Text dataset, Crowd Activity dataset, Drink Bottle dataset, and Synth Text dataset, CKLST can perform significantly better than other baselines on fine-grained image classification, with improvements of 3.54%, 5.37%, 3.28%, and 2.81% over the best baseline in mAP, respectively. Scene text in natural images carry additional semantic information to aid in image classification. Existing methods extract scene text based on simple rules or dictionaries. These methods lack full consideration of the deep understanding of the text and the visual text relationship, and are difficult to judge the semantic accuracy and the relevance of the visual text, thus they perform poorly on image classification tasks. Aiming at the above problems, this paper proposes image classification based on cross modal knowledge learning of scene text (CKLST) method. image
引用
收藏
页码:1447 / 1459
页数:13
相关论文
共 50 条
  • [31] Integration of Global and Local Representations for Fine-Grained Cross-Modal Alignment
    Jin, Seungwan
    Choi, Hoyoung
    Noh, Taehyung
    Han, Kyungsik
    COMPUTER VISION - ECCV 2024, PT LXXXIII, 2025, 15141 : 53 - 70
  • [32] VIDEO-MUSIC RETRIEVAL WITH FINE-GRAINED CROSS-MODAL ALIGNMENT
    Era, Yuki
    Togo, Ren
    Maeda, Keisuke
    Ogawa, Takahiro
    Haseyama, Miki
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 2005 - 2009
  • [33] Fine-grained and coarse-grained contrastive learning for text classification
    Zhang, Shaokang
    Ran, Ning
    NEUROCOMPUTING, 2024, 596
  • [34] Multispectral Scene Classification via Cross-Modal Knowledge Distillation
    Liu, Hao
    Qu, Ying
    Zhang, Liqiang
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [35] Social Image-Text Sentiment Classification With Cross-Modal Consistency and Knowledge Distillation
    Liu, Huan
    Li, Ke
    Fan, Jianping
    Yan, Caixia
    Qin, Tao
    Zheng, Qinghua
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2023, 14 (04) : 3332 - 3344
  • [36] Fine-Grained Image Generation Network With Radar Range Profiles Using Cross-Modal Visual Supervision
    Bao, Jiacheng
    Li, Da
    Li, Shiyong
    Zhao, Guoqiang
    Sun, Houjun
    Zhang, Yi
    IEEE TRANSACTIONS ON MICROWAVE THEORY AND TECHNIQUES, 2024, 72 (02) : 1339 - 1352
  • [37] Learning Cascade Attention for fine-grained image classification
    Zhu, Youxiang
    Li, Ruochen
    Yang, Yin
    Ye, Ning
    NEURAL NETWORKS, 2020, 122 : 174 - 182
  • [38] DEEP DICTIONARY LEARNING FOR FINE-GRAINED IMAGE CLASSIFICATION
    Srinivas, M.
    Lin, Yen-Yu
    Liao, Hong-Yuan Mark
    2017 24TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2017, : 835 - 839
  • [39] Fine-grained similarity semantic preserving deep hashing for cross-modal retrieval
    Li, Guoyou
    Peng, Qingjun
    Zou, Dexu
    Yang, Jinyue
    Shu, Zhenqiu
    FRONTIERS IN PHYSICS, 2023, 11
  • [40] Deep Multiscale Fine-Grained Hashing for Remote Sensing Cross-Modal Retrieval
    Huang, Jiaxiang
    Feng, Yong
    Zhou, Mingliang
    Xiong, Xiancai
    Wang, Yongheng
    Qiang, Baohua
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2024, 21 : 1 - 5