Cross-Modality Data Augmentation for Aerial Object Detection with Representation Learning

被引：0

作者：

Wei, Chiheng ^{[1
]}

Bai, Lianfa ^{[1
]}

Chen, Xiaoyu ^{[1
]}

Han, Jing ^{[1
]}

机构：

[1] Nanjing Univ Sci & Technol, Sch Elect & Opt Engn, Nanjing 210094, Peoples R China

来源：

REMOTE SENSING | 2024年 / 16卷 / 24期

基金：

中国国家自然科学基金;

关键词：

data augmentation; cross-modality; object detection; representation learning;

D O I：

10.3390/rs16244649

中图分类号：

X [环境科学、安全科学];

学科分类号：

08 ; 0830 ;

摘要：

Data augmentation methods offer a cost-effective and efficient alternative to the acquisition of additional data, significantly enhancing data diversity and model generalization, making them particularly favored in object detection tasks. However, existing data augmentation techniques primarily focus on the visible spectrum and are directly applied to RGB-T object detection tasks, overlooking the inherent differences in image data between the two tasks. Visible images capture rich color and texture information during the daytime, while infrared images are capable of imaging under low-light complex scenarios during the nighttime. By integrating image information from both modalities, their complementary characteristics can be exploited to improve the overall effectiveness of data augmentation methods. To address this, we propose a cross-modality data augmentation method tailored for RGB-T object detection, leveraging masked image modeling within representation learning. Specifically, we focus on the temporal consistency of infrared images and combine them with visible images under varying lighting conditions for joint data augmentation, thereby enhancing the realism of the augmented images. Utilizing the masked image modeling method, we reconstruct images by integrating multimodal features, achieving cross-modality data augmentation in feature space. Additionally, we investigate the differences and complementarities between data augmentation methods in data space and feature space. Building upon existing theoretical foundations, we propose an integrative framework that combines these methods for improved augmentation effectiveness. Furthermore, we address the slow convergence observed with the existing Mosaic method in aerial imagery by introducing a multi-scale training strategy and proposing a full-scale Mosaic method as a complement. This optimization significantly accelerates network convergence. The experimental results validate the effectiveness of our proposed method and highlight its potential for further advancements in cross-modality object detection tasks.

引用

页数：23

共 60 条

[1] Wong S.C., Gatt A., Stamatescu V., McDonnell M.D., Understanding data augmentation for classification: When to warp?, Proceedings of the 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pp. 1-6, (2016)
[2] Zhong Z., Zheng L., Kang G., Li S., Yang Y., Random erasing data augmentation, Proceedings of the AAAI Conference on Artificial Intelligence, 34, pp. 13001-13008
[3] Yun S., Han D., Oh S.J., Chun S., Choe J., Yoo Y., Cutmix: Regularization strategy to train strong classifiers with localizable features, Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6023-6032
[4] Chen P., Liu S., Zhao H., Wang X., Jia J., Gridmask data augmentation, arXiv, (2020)
[5] Bochkovskiy A., Wang C.Y., Liao H.Y.M., Yolov4: Optimal speed and accuracy of object detection, arXiv, (2020)
[6] Georgievski B., Image augmentation with neural style transfer, Proceedings of the International Conference on ICT Innovations, pp. 212-224, (2019)
[7] Goodfellow I., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A., Bengio Y., Generative adversarial nets, Adv. Neural Inf. Process. Syst, 27, (2014)
[8] Ghiasi G., Cui Y., Srinivas A., Qian R., Lin T., Cubuk E.D., Le Q.V., Zoph B., Simple copy-paste is a strong data augmentation method for instance segmentation, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2918-2928
[9] Bao H., Dong L., Piao S., Wei F., Beit: Bert pre-training of image transformers, arXiv, (2021)
[10] He K., Chen X., Xie S., Li Y., Dollar P., Girshick R., Masked autoencoders are scalable vision learners, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000-16009

← 1 2 3 4 5 6 →