In today's highly competitive industrial environment, digital transformation and smart manufacturing have become crucial strategies for enhancing competitiveness. Companies undergoing digital transformation often face challenges like high initial investment, hardware-software integration difficulties, and debugging issues due to low interpretability in deep learning implementation. Therefore, this study focuses on the integration of explainable AI models and depth cameras in the footwear industry to achieve model explainability and automation of production line processes in an economic manner. By combining YOLOv7 and Mask R-CNN, a real-time object detection system is achieved to provide accurate object coordinates and tilt angles. The integration with depth cameras enables the robotic arm to grasp objects accurately in a cluttered environment. The proposed model exhibits a high accuracy rate of 97% in a simulated scenario of stacking insole pads. This technology brings significant advantages, including reducing hardware equipment investment by 20%, streamlining production processes, reducing labor costs, and enhancing overall productivity. Moreover, the model's explainability aids in system troubleshooting and errors reduction caused during digital transformation. By leveraging this integrated approach, businesses in the footwear industry can upgrade their production processes, reduce costs, and improve competitiveness in the market.