Cross-modal knowledge learning with scene text for fine-grained image classification

被引:0
|
作者
Xiong, Li [1 ,2 ]
Mao, Yingchi [1 ,2 ,5 ]
Wang, Zicheng [1 ,3 ]
Nie, Bingbing [4 ]
Li, Chang [1 ,2 ]
机构
[1] Hohai Univ, Sch Comp & Informat, Nanjing, Peoples R China
[2] Hohai Univ, Minist Water Resources, Key Lab Water Big Data Technol, Nanjing, Peoples R China
[3] Power China Kunming Engn Corp Ltd, Kunming, Yunnan, Peoples R China
[4] Huaneng Lancang River Hydropower Corp Ltd, Kunming, Yunnan, Peoples R China
[5] Hohai Univ, Sch Comp & Informat, Nanjing 210098, Peoples R China
关键词
feature extraction; image classification;
D O I
10.1049/ipr2.13039
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Scene text in natural images carries additional semantic information to aid in image classification. Existing methods lack full consideration of the deep understanding of the text and the visual text relationship, which results in the difficult to judge the semantic accuracy and the relevance of the visual text. This paper proposes image classification based on Cross modal Knowledge Learning of Scene Text (CKLST) method. CKLST consists of three stages: cross-modal scene text recognition, text semantic enhancement, and visual-text feature alignment. In the first stage, multi-attention is used to extract features layer by layer, and a self-mask-based iterative correction strategy is utilized to improve the scene text recognition accuracy. In the second stage, knowledge features are extracted using external knowledge and are fused with text features to enhance text semantic information. In the third stage, CKLST realizes visual-text feature alignment across attention mechanisms with a similarity matrix, thus the correlation between images and text can be captured to improve the accuracy of the image classification tasks. On Con-Text dataset, Crowd Activity dataset, Drink Bottle dataset, and Synth Text dataset, CKLST can perform significantly better than other baselines on fine-grained image classification, with improvements of 3.54%, 5.37%, 3.28%, and 2.81% over the best baseline in mAP, respectively. Scene text in natural images carry additional semantic information to aid in image classification. Existing methods extract scene text based on simple rules or dictionaries. These methods lack full consideration of the deep understanding of the text and the visual text relationship, and are difficult to judge the semantic accuracy and the relevance of the visual text, thus they perform poorly on image classification tasks. Aiming at the above problems, this paper proposes image classification based on cross modal knowledge learning of scene text (CKLST) method. image
引用
收藏
页码:1447 / 1459
页数:13
相关论文
共 50 条
  • [1] Fine-grained Feature Assisted Cross-modal Image-text Retrieval
    Bu, Chaofei
    Liu, Xueliang
    Huang, Zhen
    Su, Yuling
    Tu, Junfeng
    Hong, Richang
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT XI, 2025, 15041 : 306 - 320
  • [2] Fine-Grained Cross-Modal Contrast Learning for Video-Text Retrieval
    Liu, Hui
    Lv, Gang
    Gu, Yanhong
    Nian, Fudong
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT V, ICIC 2024, 2024, 14866 : 298 - 310
  • [3] CROSS-MODAL KNOWLEDGE DISTILLATION FOR FINE-GRAINED ONE-SHOT CLASSIFICATION
    Zhao, Jiabao
    Lin, Xin
    Yang, Yifan
    Yang, Jing
    He, Liang
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 4295 - 4299
  • [4] Fine-grained Image-text Matching by Cross-modal Hard Aligning Network
    Pan, Zhengxin
    Wu, Fangyu
    Zhang, Bailing
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19275 - 19284
  • [5] Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval
    Wang, Hao
    Lin, Guosheng
    Hoi, Steven
    Miao, Chunyan
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5517 - 5526
  • [6] Fine-Grained Cross-Modal Fusion Based Refinement for Text-to-Image Synthesis
    Haoran, Sun
    Yang, Wang
    Haipeng, Liu
    Biao, Qian
    CHINESE JOURNAL OF ELECTRONICS, 2023, 32 (06) : 1329 - 1340
  • [7] Fine-Grained Cross-Modal Fusion Based Refinement for Text-to-Image Synthesis
    SUN Haoran
    WANG Yang
    LIU Haipeng
    QIAN Biao
    Chinese Journal of Electronics, 2023, 32 (06) : 1329 - 1340
  • [8] Cross-modal subspace learning for fine-grained sketch-based image retrieval
    Xu, Peng
    Yin, Qiyue
    Huang, Yongye
    Song, Yi-Zhe
    Ma, Zhanyu
    Wang, Liang
    Xiang, Tao
    Kleijn, W. Bastiaan
    Guo, Jun
    NEUROCOMPUTING, 2018, 278 : 75 - 86
  • [9] Cross-modal recipe retrieval based on unified text encoder with fine-grained contrastive learning
    Zhang, Bolin
    Kyutoku, Haruya
    Doman, Keisuke
    Komamizu, Takahiro
    Ide, Ichiro
    Qian, Jiangbo
    KNOWLEDGE-BASED SYSTEMS, 2024, 305
  • [10] Integrating Scene Text and Visual Appearance for Fine-Grained Image Classification
    Bai, Xiang
    Yang, Mingkun
    Lyu, Pengyuan
    Xu, Yongchao
    Luo, Jiebo
    IEEE ACCESS, 2018, 6 : 66322 - 66335