Identifying small, overlapping wheat ears in UAV images continues to be a difficult task. This paper proposes SCP-YOLO, a novel detection model that addresses this limitation. Initially, the dataset comprises remote sensing images of wheat kernels captured at two periods and three altitudes. Using the YOLOv8n network as a baseline, SCP-YOLO processes the network’s low-resolution feature layer using the space-to-depth (SPD) approach. At the stage of feature fusion, the Context-Aggregation structure is executed to facilitate the aggregation and interaction of data on the feature map on a global scale. The PConv method ingeniously implements the lightweight detection head structure. On top of that, a new detection scale that integrates more superficial information with location data is positioned. The experimental outcomes demonstrate that the proposed method outperformed several established state-of-the-art detection models by achieving a detection speed of 90 frames per second and an AP@50 value of 96.3%. Compared with the baseline network, the AP@0.5, and AP@0.5:95 exhibited respective increases of 2.5% and 6.3%, respectively. Experimental results indicate that the methodology demonstrates exceptional robustness for six scenario datasets. About counting, it is exact and capable of quantifying wheat ears in images acquired through remote sensing. © The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024.