Timely crack detection is crucial to maintain the health of wind turbine blades. However, the composite structure of the wind turbine blade (WTB) inner surface is intricately composed. Current object detection networks lack efficiency and accuracy in crack detection on the WTB owing to the low saliency cracks of the inner surface. Meanwhile, the impact on the accuracy of the crack detection model due to the low dynamic range of a single visible crack image on the inner surface of the WTB cannot be neglected. For the sake of solving the above problems, a new dataset is set, consisting of infrared (IR) and visible (VIS) images. The new network structure is proposed to utilize the above dataset, achieving the fusion of crack information between two modal images (IR-VIS). Moreover, to quantify cracks with width, height, and rotation angles accurately, a rotated bounding box method is introduced. Last but not least, a two-stage training method is designed to improve the training results, which can mitigate the low training speed of the model caused by the complexity of the fusion structure. The experimental study is carried out, showing the IR-VIS dataset is effective for crack detection, and the modified model improved detection performance by 31.59%. In crack detection of the inner surface on the WTB, the modified model outperforms base object detection model in the ablative experiments. Our method can improve accuracy up to 82.02% by fusing IR-VIS images. In addition, our model achieves 75.94% accuracy and runs at 38 FPS speed when paired with the Jetson Nano, a limited computational embedded device.