Multidomain feature fusion method for small object classification: MDFF

被引：0

作者：

Hu, Jing ^{[1
]}

Shi, Zican ^{[2
]}

Zhang, Zheng ^{[2
]}

Lv, Siqi ^{[2
]}

Chen, Yifan ^{[2
]}

Ouyang, Yan ^{[3
]}

He, Jia ^{[4
]}

机构：

[1] Natl Key Lab Sci & Technol Multi Spectral Informa, Wuhan, Peoples R China

[2] Huazhong Univ Sci & Technol, Wuhan, Peoples R China

[3] Air Force Early Warning Acad, Wuhan, Peoples R China

[4] Chinese Peoples Liberat Army, Troops 95841, Peoples R China

来源：

JOURNAL OF ELECTRONIC IMAGING | 2023年 / 32卷 / 04期

关键词：

ConvMixer; vision transformer; small object classification; frequency domain;

D O I：

10.1117/1.JEI.32.4.043009

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

The task of classifying small objects is still challenging for current deep learning classification models [such as convolutional neural networks (CNNs) and vision transformers (ViTs)]. We believe that these algorithms are not designed specifically for small targets, so their feature extraction abilities for small targets are insufficient. To improve the classification capabilities of CNN-based and ViT-based classification models for small objects, two multidomain feature fusion (MDFF) frameworks are proposed to increase the amount of feature information derived from images and they are called MDFF-ConvMixer and MDFF-ViT. Compared with the basic model, the uniquely added design includes frequency domain feature extraction and MDFF processes. In the frequency domain feature extraction part, the input image is first transformed into a frequency domain form through discrete cosine transform (DCT) transformation and then a three-dimensional matrix containing the frequency domain information is obtained via channel splicing and reshaping. In the MDFF part, MDFF-ConvMixer splices the spatial and frequency domain features by channel, whereas MDFF-ViT uses a cross-attention mechanism to fuse the spatial and frequency domain features. When targeting small target classification tasks, these two frameworks obviously improve the utilized classification algorithm. On the DOTA dataset and the CIFAR10 dataset with two downsampling operations, the accuracies of MDFF-ConvMixer relative to ConvMixer increase from 87.82% and 62.14% to 90.14% and 66.00%, respectively, and the accuracies of MDFF-ViT relative to the ViT increase from 79.22% and 36.2% to 88.15% and 59.23%, respectively. (c) The Authors. Published by SPIE under a Creative Commons Attribution 4.0 International License. Distribution or reproduction of this work in whole or in part requires full attribution of the original publication, including its DOI.

引用

页数：18