MLP-based multimodal tomato detection in complex scenarios: Insights from task-specific analysis of feature fusion architectures

被引：6

作者：

Chen, Wenjun ^{[1
,2
,3
]}

Rao, Yuan ^{[1
,2
,3
]}

Wang, Fengyi ^{[1
,2
,3
]}

Zhang, Yu ^{[1
,2
,3
]}

Wang, Tan ^{[1
,2
,3
]}

Jin, Xiu ^{[1
,2
]}

Hou, Wenhui ^{[2
,3
,4
]}

Jiang, Zhaohui ^{[1
,2
,3
]}

Zhang, Wu ^{[1
,2
,3
]}

机构：

[1] Anhui Agr Univ, Sch Informat & Artificial Intelligence, Hefei 230036, Anhui, Peoples R China

[2] Minist Agr & Rural Affairs, Key Lab Agr Sensors, Hefei 230036, Anhui, Peoples R China

[3] Anhui Prov Key Lab Smart Agr Technol & Equipment, Hefei 230036, Anhui, Peoples R China

[4] Anhui Agr Univ, Coll Engn, Hefei 230036, Anhui, Peoples R China

来源：

COMPUTERS AND ELECTRONICS IN AGRICULTURE | 2024年 / 221卷

基金：

中国国家自然科学基金;

关键词：

Tomato detection; Multimodal; Feature fusion; YOLO; Complex scenarios; DEEP;

D O I：

10.1016/j.compag.2024.108951

中图分类号：

S [农业科学];

学科分类号：

09 ;

摘要：

Accurate and efficient tomato detection is essential for the practical deployment of robotic picking in practical agricultural applications, but it still remains significantly challenging to detect tomatoes in complex scenarios with fluctuating light, overlapping fruits, and occlusion from branches and leaves when solely using RGB images. The recent development of RGB-D sensors has brought one promising opportunity to adopt multimodal fusion for implementing high-quality fruit detection. However, the feasibility of the existing multimodal fusion and feature extraction architectures for lightweight tomato detection tasks, especially in complex agricultural scenarios, raises questions that need to be explored. As a remedy, we proposed a multimodal fusion encoder that leveraged depth and near-infrared modalities to assist RGB images in making full use of multimodal data. Moreover, the encoder contained a plug-and-play structure capable of being implemented as MLP-based (Multi-Layer Perceptron), ViT-based (Vision Transformer), or CNN-based (Convolutional Neural Networks) architectures. Furthermore, we developed a lightweight experimental detection framework based on YOLOv7-tiny by means of integrating the multimodal fusion encoder, and YOLO-DNA (Depth and Near-infrared Assisted) was put forward based on the MLP-based architecture after conducting comprehensive analysis of the aforementioned three architectures. In addition, a tomato multimodal dataset containing visible, depth, and near-infrared images was established. Experimental results demonstrated that YOLO-DNA achieved mAP 0 . 5 of 98.13% and mAP 0 . 5 : 0 . 95 of 74.0%, an average increase of 5.01% in mAP 0 . 5 and 14.55% in mAP 0 . 5 : 0 . 95 over mainstream lightweight detection models, with a detection speed of 37.12 FPS, meeting the demand of real -time tomato detection. This finding has the potential to advance research on fruit detection in the field of intelligent agricultural harvesting.

引用

页数：20

共 51 条

[1] Multispectral vineyard segmentation: A deep learning comparison study [J].

Barros, T. ;

Conde, P. ;

Goncalves, G. ;

Premebida, C. ;

Monteiro, M. ;

Ferreira, C. S. S. ;

Nunes, U. J. .

COMPUTERS AND ELECTRONICS IN AGRICULTURE, 2022, 195

[2]

Bochkovskiy A, 2020, Arxiv, DOI [arXiv:2004.10934, 10.48550/arXiv.2004.10934]

[3] An improved Yolov3 based on dual path network for cherry tomatoes detection [J].

Chen, Jiqing ;

Wang, Zhikui ;

Wu, Jiahua ;

Hu, Qiang ;

Zhao, Chaoyang ;

Tan, Chengzhi ;

Teng, Long ;

Luo, Tian .

JOURNAL OF FOOD PROCESS ENGINEERING, 2021, 44 (10)

[4] MTD-YOLO: Multi-task deep convolutional neural network for cherry tomato fruit bunch maturity detection [J].

Chen, Wenbai ;

Liu, Mengchen ;

Zhao, ChunJiang ;

Li, Xingxu ;

Wang, Yiqun .

COMPUTERS AND ELECTRONICS IN AGRICULTURE, 2024, 216

[5] Multi-View 3D Object Detection Network for Autonomous Driving [J].

Chen, Xiaozhi ;

Ma, Huimin ;

Wan, Ji ;

Li, Bo ;

Xia, Tian .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6526-6534

[6] Xception: Deep Learning with Depthwise Separable Convolutions [J].

Chollet, Francois .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1800-1807

[7] RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality [J].

Ding, Xiaohan ;

Chen, Honghao ;

Zhang, Xiangyu ;

Han, Jungong ;

Ding, Guiguang .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :568-577

[8]

Dosovitskiy A., 2021, TRANSFORMERS IMAGE R, DOI [10.48550/arXiv.2010.11929, DOI 10.48550/ARXIV.2010.11929]

[9] Real-time defects detection for apple sorting using NIR cameras with pruning-based YOLOV4 network [J].

Fan, Shuxiang ;

Liang, Xiaoting ;

Huang, Wenqian ;

Zhang, Vincent Jialong ;

Pang, Qi ;

He, Xin ;

Li, Lianjie ;

Zhang, Chi .

COMPUTERS AND ELECTRONICS IN AGRICULTURE, 2022, 193

[10] Immature green citrus fruit detection using color and thermal images [J].

Gan, H. ;

Lee, W. S. ;

Alchanatis, V. ;

Ehsani, R. ;

Schueller, J. K. .

COMPUTERS AND ELECTRONICS IN AGRICULTURE, 2018, 152 :117-125

← 1 2 3 4 5 6 →