A computational model that can reproduce the process of language acquisition, including word discovery and generation, by human children is crucially important in understanding the human developmental process. Such a model should not depend on transcribed data, which are often provided manually when researchers train artificial automatic speech recognition and speech synthesis systems. One of the main differences between speech recognition and production by human infants and those by conventional computer systems concerns the access to transcribed data, i.e., supervised learning with transcribed data or unsupervised learning without transcribed data. This study proposes an unsupervised machine learning method for spoken word production that does not use any transcribed data, i.e., the entire system is trained purely using speech signals that the system (the robot) can obtain from its auditory sensor, e.g., a microphone. The method is based on the nonparametric Bayesian double articulation analyzer (NPB-DAA), which is an unsupervised machine learning method that enables a robot to identify word-like and phoneme-like linguistic units in speech signals alone, and a hidden Markov model-based (HMM-based) statistical speech synthesis method, which has been widely used to develop text-to-speech (TTS) systems. Latent letters, i.e., phoneme-like units, and latent words, i.e., word-like units, discovered by the NPB-DAA are used to train the HMM-based TTS system. We present two experiments that used Japanese vowel sequences and an English spoken digit corpus, respectively. Both experimental results showed that the proposed method can produce many spoken words that can be recognized as the original words provided by the human speakers. Furthermore, we discuss future challenges in creating a robot that can autonomously learn phoneme systems and vocabulary only from sensor-motor information.