Cross-modal knowledge learning with scene text for fine-grained image classification

被引:0
|
作者
Xiong, Li [1 ,2 ]
Mao, Yingchi [1 ,2 ,5 ]
Wang, Zicheng [1 ,3 ]
Nie, Bingbing [4 ]
Li, Chang [1 ,2 ]
机构
[1] Hohai Univ, Sch Comp & Informat, Nanjing, Peoples R China
[2] Hohai Univ, Minist Water Resources, Key Lab Water Big Data Technol, Nanjing, Peoples R China
[3] Power China Kunming Engn Corp Ltd, Kunming, Yunnan, Peoples R China
[4] Huaneng Lancang River Hydropower Corp Ltd, Kunming, Yunnan, Peoples R China
[5] Hohai Univ, Sch Comp & Informat, Nanjing 210098, Peoples R China
关键词
feature extraction; image classification;
D O I
10.1049/ipr2.13039
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Scene text in natural images carries additional semantic information to aid in image classification. Existing methods lack full consideration of the deep understanding of the text and the visual text relationship, which results in the difficult to judge the semantic accuracy and the relevance of the visual text. This paper proposes image classification based on Cross modal Knowledge Learning of Scene Text (CKLST) method. CKLST consists of three stages: cross-modal scene text recognition, text semantic enhancement, and visual-text feature alignment. In the first stage, multi-attention is used to extract features layer by layer, and a self-mask-based iterative correction strategy is utilized to improve the scene text recognition accuracy. In the second stage, knowledge features are extracted using external knowledge and are fused with text features to enhance text semantic information. In the third stage, CKLST realizes visual-text feature alignment across attention mechanisms with a similarity matrix, thus the correlation between images and text can be captured to improve the accuracy of the image classification tasks. On Con-Text dataset, Crowd Activity dataset, Drink Bottle dataset, and Synth Text dataset, CKLST can perform significantly better than other baselines on fine-grained image classification, with improvements of 3.54%, 5.37%, 3.28%, and 2.81% over the best baseline in mAP, respectively. Scene text in natural images carry additional semantic information to aid in image classification. Existing methods extract scene text based on simple rules or dictionaries. These methods lack full consideration of the deep understanding of the text and the visual text relationship, and are difficult to judge the semantic accuracy and the relevance of the visual text, thus they perform poorly on image classification tasks. Aiming at the above problems, this paper proposes image classification based on cross modal knowledge learning of scene text (CKLST) method. image
引用
收藏
页码:1447 / 1459
页数:13
相关论文
共 50 条
  • [21] Cross-Part Learning for Fine-Grained Image Classification
    Liu, Man
    Zhang, Chunjie
    Bai, Huihui
    Zhang, Riquan
    Zhao, Yao
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 748 - 758
  • [22] Multi-modal Knowledge-Enhanced Fine-Grained Image Classification
    Cheng, Suyan
    Zhang, Feifei
    Zhou, Haoliang
    Xu, Changsheng
    PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 333 - 346
  • [23] Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval
    Yuan, Zhiqiang
    Zhang, Wenkai
    Fu, Kun
    Li, Xuan
    Deng, Chubo
    Wang, Hongqi
    Sun, Xian
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [24] Deep attentional fine-grained similarity network with adversarial learning for cross-modal retrieval
    Qingrong Cheng
    Xiaodong Gu
    Multimedia Tools and Applications, 2020, 79 : 31401 - 31428
  • [25] Deep attentional fine-grained similarity network with adversarial learning for cross-modal retrieval
    Cheng, Qingrong
    Gu, Xiaodong
    MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (41-42) : 31401 - 31428
  • [26] Fine-Grained Label Learning via Siamese Network for Cross-modal Information Retrieval
    Xu, Yiming
    Yu, Jing
    Guo, Jingjing
    Hu, Yue
    Tan, Jianlong
    COMPUTATIONAL SCIENCE - ICCS 2019, PT II, 2019, 11537 : 304 - 317
  • [27] Fine-grained bidirectional attentional generation and knowledge-assisted networks for cross-modal retrieval
    Zhu, Jianwei
    Li, Zhixin
    Wei, Jiahui
    Zeng, Yufei
    Ma, Huifang
    IMAGE AND VISION COMPUTING, 2022, 124
  • [28] Fine-grained bidirectional attentional generation and knowledge-assisted networks for cross-modal retrieval
    Zhu, Jianwei
    Li, Zhixin
    Wei, Jiahui
    Zeng, Yufei
    Ma, Huifang
    Image and Vision Computing, 2022, 124
  • [29] Multi-label adversarial fine-grained cross-modal retrieval
    Sun, Chunpu
    Zhang, Huaxiang
    Liu, Li
    Liu, Dongmei
    Wang, Lin
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2023, 117
  • [30] Cross-Modal Fine-Grained Interaction Fusion in Fake News Detection
    Che, Zhanbin
    Cui, GuangBo
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (05) : 945 - 956