Cross-modal knowledge learning with scene text for fine-grained image classification

被引:0
|
作者
Xiong, Li [1 ,2 ]
Mao, Yingchi [1 ,2 ,5 ]
Wang, Zicheng [1 ,3 ]
Nie, Bingbing [4 ]
Li, Chang [1 ,2 ]
机构
[1] Hohai Univ, Sch Comp & Informat, Nanjing, Peoples R China
[2] Hohai Univ, Minist Water Resources, Key Lab Water Big Data Technol, Nanjing, Peoples R China
[3] Power China Kunming Engn Corp Ltd, Kunming, Yunnan, Peoples R China
[4] Huaneng Lancang River Hydropower Corp Ltd, Kunming, Yunnan, Peoples R China
[5] Hohai Univ, Sch Comp & Informat, Nanjing 210098, Peoples R China
关键词
feature extraction; image classification;
D O I
10.1049/ipr2.13039
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Scene text in natural images carries additional semantic information to aid in image classification. Existing methods lack full consideration of the deep understanding of the text and the visual text relationship, which results in the difficult to judge the semantic accuracy and the relevance of the visual text. This paper proposes image classification based on Cross modal Knowledge Learning of Scene Text (CKLST) method. CKLST consists of three stages: cross-modal scene text recognition, text semantic enhancement, and visual-text feature alignment. In the first stage, multi-attention is used to extract features layer by layer, and a self-mask-based iterative correction strategy is utilized to improve the scene text recognition accuracy. In the second stage, knowledge features are extracted using external knowledge and are fused with text features to enhance text semantic information. In the third stage, CKLST realizes visual-text feature alignment across attention mechanisms with a similarity matrix, thus the correlation between images and text can be captured to improve the accuracy of the image classification tasks. On Con-Text dataset, Crowd Activity dataset, Drink Bottle dataset, and Synth Text dataset, CKLST can perform significantly better than other baselines on fine-grained image classification, with improvements of 3.54%, 5.37%, 3.28%, and 2.81% over the best baseline in mAP, respectively. Scene text in natural images carry additional semantic information to aid in image classification. Existing methods extract scene text based on simple rules or dictionaries. These methods lack full consideration of the deep understanding of the text and the visual text relationship, and are difficult to judge the semantic accuracy and the relevance of the visual text, thus they perform poorly on image classification tasks. Aiming at the above problems, this paper proposes image classification based on cross modal knowledge learning of scene text (CKLST) method. image
引用
收藏
页码:1447 / 1459
页数:13
相关论文
共 50 条
  • [41] Fine-grained sentiment Feature Extraction Method for Cross-modal Sentiment Analysis
    Sun, Ye
    Jin, Guozhe
    Zhao, Yahui
    Cui, Rongyi
    2024 16TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, ICMLC 2024, 2024, : 602 - 608
  • [42] Fine-Grained Correlation Learning with Stacked Co-attention Networks for Cross-Modal Information Retrieval
    Lu, Yuhang
    Yu, Jing
    Liu, Yanbing
    Tan, Jianlong
    Guo, Li
    Zhang, Weifeng
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT (KSEM 2018), PT I, 2018, 11061 : 213 - 225
  • [43] Cross-modal distillation with audio-text fusion for fine-grained emotion classification using BERT and Wav2vec 2.0
    Kim, Donghwa
    Kang, Pilsung
    NEUROCOMPUTING, 2022, 506 : 168 - 183
  • [44] Cross-Graph Attention Enhanced Multi-Modal Correlation Learning for Fine-Grained Image-Text Retrieval
    He, Yi
    Liu, Xin
    Cheung, Yiu-ming
    Peng, Shu-Juan
    Yi, Jinhan
    Fan, Wentao
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1865 - 1869
  • [45] Text-guided Attention Mechanism Fine-grained Image Classification
    Yang, Xinglin
    Pan, Heng
    2022 THE 6TH INTERNATIONAL CONFERENCE ON VIRTUAL AND AUGMENTED REALITY SIMULATIONS, ICVARS 2022, 2022, : 45 - 49
  • [46] Fine-grained Pseudo Labels for Scene Text Recognition
    Li, Xiaoyu
    Chen, Xiaoxue
    Huang, Zuming
    Xie, Lele
    Chen, Jingdong
    Yang, Ming
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5786 - 5795
  • [47] Fine-Grained Language Identification in Scene Text Images
    Li, Yongrui
    Wu, Shilian
    Yu, Jun
    Wang, Zengfu
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4573 - 4581
  • [48] ViT2CMH: Vision Transformer Cross-Modal Hashing for Fine-Grained Vision-Text Retrieval
    Li M.
    Li Q.
    Jiang Z.
    Ma Y.
    Computer Systems Science and Engineering, 2023, 46 (02): : 1401 - 1414
  • [49] Cross-modal Image-Text Retrieval with Multitask Learning
    Luo, Junyu
    Shen, Ying
    Ao, Xiang
    Zhao, Zhou
    Yang, Min
    PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19), 2019, : 2309 - 2312
  • [50] A fine-grained approach to scene text script identification
    Gomez, Lluis
    Karatzas, Dimosthenis
    PROCEEDINGS OF 12TH IAPR WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS, (DAS 2016), 2016, : 192 - 197