Visual localization plays an essential role in a variety of different fields. The indirect learning based method obtains an excellent performance, but it requests a training process in the target scene before the localization. To achieve deep scene-independent localization, we start by proposing the representation called residual coordinate map between a pair of images. Based on the structure, we put forward a network called SILocNet with the proposed residual coordinate map as the output. The network consists of feature extraction, multi-level feature fusion and transformer based coordinate decoder. Moreover, considering the dynamic scene, we introduce an additional segmentation branch that distinguishes fixed and dynamic parts to promote network perception. With SILocNet in place, a cascaded localizer design is presented for reducing the accumulative error. Meanwhile, the simple mathematical analysis behind the cascaded localizers is also provided. To verify how well our algorithm could perform, we conduct experiments on static 7 Scenes, ScanNet and dynamic TUM RGB-D. In particular, we train the network on ScanNet and test it on 7 Scenes and TUM RGB-D to demonstrate the generality performance. All experiments demonstrate superior performance to other existing methods. Additionally, the effects of the cascaded localizer design, feature fusion, transformer based coordinate decoder and segmentation loss are also discussed.