The real-time monitoring and counting of maize seed germination at seedling stage is of great significance for seed quality detection, field management and yield estimation. Traditional manual monitoring and counting is very time-consuming, cumbersome and error-prone. In order to quickly and accurately identify and count maize seedlings in a complex field environment, this study proposes an end-to-end maize seedling plant detection model H-RT-DETR (Hierarchical-Real-Time DEtection TRansformer) based on hierarchical feature extraction and RT-DETR (Real-Time DEtection TRansformer). H-RT-DETR uses Hierarchical Feature Representation and Efficient Self-Attention as the backbone network for feature extraction, thereby improving the network's ability to extract features of maize seedling stage in UAV remote sensing images. Through experiments on the UAV remote sensing data set of maize seedling stage, the mean Average Precision mAP0.5–0.95, mAP0.5 and mAP0.75 of the improved H-RT-DETR model reached 51.2%, 94.7% and 48.1%, respectively, and the Average Recall (AR) reached 68.5%. In order to verify the efficiency of the proposed method, H-RT-DETR is compared with the widely used and advanced target recognition methods. The results show that the detection accuracy of H-RT-DETR is better than that of the comparison methods. In terms of detection speed, the H-RT-DETR model does not require Non-Maximum Suppression (NMS) post-processing operations, the Frames Per Second (FPS) on the test dataset reaches 84f/s, which is 19,12,11 and 21 higher than that of YOLOv5, YOLOv7, YOLOv8 and YOLOX, respectively, under the same hardware environment. This model can provide technical support for real-time detection of maize seedlings under UAV remote sensing images in terms of both detection accuracy and speed (see https://github.com/wylSUGAR/H-RT-DETR for model implementation and results).