With the gradual expansion of computer vision application fields, the demand for object detection based on unmanned aerial vehicle (UAV) aerial images continues to grow. Traditional methods have limitations in handling scale changes, motion blur, and complex backgrounds. We propose a novel approach that combines the model You Only Look Once version 5 based on convolutional neural network with the sequence modeling technology Transformer to better capture long-range dependencies and contextual information, thereby improving detection performance. Experimental results on the VisDrone dataset show that the proposed method has comparable performance to existing methods, demonstrating its effectiveness in UAV object detection.