Image captioning is a challenging task, i.e., given an image machine automatically generates natural language that matches its semantic content and has attracted much attention in recent years. However, most existing models are designed manually, and their performance depends heavily on the expert experience of the designer. In addition, the computational flow of the model is predefined, and hard and easy samples will share the same coding path and easily interfere with each other, thus confusing the learning of the model. In this paper, we propose a Dynamic Transformer to change the encoding procedure from sequential to adaptive, i.e., data-dependent computing paths. Specifically, we design three different types of visual feature extraction blocks and deploy them in parallel at each layer to construct a multi-layer routing space in a fully connected manner. Each block contains a calculation unit that performs the corresponding operations and a routing gate that learns to adaptively select the direction to pass the signal based on the input image. Thus, our model can achieve a robust visual representation by exploring potential visual feature extraction paths. We evaluate our method quantitatively and qualitatively using a benchmark MSCOCO image caption dataset and perform extensive ablation studies to investigate the reasons behind its effectiveness. The experimental results show that our method is significantly superior to previous state-of-the-art methods.