Though neural text-to-speech (TTS) models show remarkable performance, they still require a large amount of < speech, text> paired dataset, which is expensive to collect. The heavy demand for collecting paired datasets makes the TTS models support only a small number of speakers and languages. To address this problem, we introduce a transfer learning framework for multi-lingual, zero-shot multi-speaker, and low-resource TTS. Firstly, we pretrain our model in an unsupervised manner with a multi-lingual multi-speaker speech-only dataset by leveraging the self-supervised speech representations as intermediate linguistic representations. Given this pretrained linguistic information, we then apply a supervised learning technique to the TTS model with a small amount of paired dataset. The pretrained linguistic representations extracted from the large-scale speech-only dataset facilitate phoneme-to-linguistic feature matching, which provides good guidance for supervised learning with a limited amount of labeled data. We evaluate the performance of our proposed model in low-resource, multi-lingual, and zero-shot multi-speaker TTS tasks. The experimental results demonstrate that our proposed method outperforms the baseline in terms of naturalness, intelligibility, and speaker similarity.