Object recognition in remote sensing images presents unique challenges due to the diverse scales, shapes, and distributions of objects, particularly small and complex ones. Existing frameworks, such as RT-DETR, struggle to accurately detect small objects because of their limited ability to extract fine-grained details and integrate multi-scale information. To overcome these challenges, we propose an enhanced object recognition model based on a hybrid convolution and transformer structure. This model improves two critical components of the original RT-DETR by introducing the Multi-Scale Adaptive Attention Module (MSAAM) and the Hybrid Feature Fusion Module (HFFM), specifically designed to enhance feature extraction and integration. The MSAAM strengthens the ResNet backbone by adaptively combining local and global information, ensuring the effective extraction of fine-grained details while emphasizing features critical for small object detection. The HFFM, integrated into the final stages of the neck, employs a dual-branch design to balance fine-grained local detail extraction and large-scale contextual understanding. By employing group convolution, depthwise separable convolution, and attention mechanisms, the HFFM mitigates the loss of fine details caused by downsampling while leveraging the expanded receptive field for broader context understanding. Experimental results demonstrate that the proposed model achieves superior object recognition performance, particularly for small objects, making it well-suited for remote sensing applications.