Remote sensing object detection remains a challenge under complex conditions such as low light, adverse weather, modality attacks or losses. Previous approaches typically alleviate this problem by enhancing visible images or leveraging multi-modal fusion technologies. In view of this, the authors propose a unified framework based on YOLO-World that combines the advantages of both schemes, achieving more adaptable and robust remote sensing object detection in complex real-world scenarios. This framework introduces a unified modality modelling strategy, allowing the model to learn abundant object features from multiple remote sensing datasets. Additionally, a U-fusion neck based on the diffusion method is designed to effectively remove modality-specific noise and generate missing complementary features. Extensive experiments were conducted on four remote sensing image datasets: Multimodal VEDAI, DroneVehicle, unimodal VisDrone and UAVDT. This approach achieves average precision scores of 50.5%$\%$, 55.3%$\%$, 25.1%$\%$, and 20.7%$\%$, which outperforms advanced multimodal remote sensing object detection methods and low-light image enhancement techniques.