Unsupervised Learning for Spoken Word Production based on Simultaneous Word and Phoneme Discovery without Transcribed Data

被引:0
作者
Miyuki, Yuusuke [1 ]
Hagiwara, Yoshinobu [1 ]
Taniguchi, Tadahiro [1 ]
机构
[1] Ritsumeikan Univ, Coll Informat Sci & Engn, Kusatsu, Shiga, Japan
来源
2017 THE SEVENTH JOINT IEEE INTERNATIONAL CONFERENCE ON DEVELOPMENT AND LEARNING AND EPIGENETIC ROBOTICS (ICDL-EPIROB) | 2017年
关键词
LANGUAGE-ACQUISITION; HMM;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A computational model that can reproduce the process of language acquisition, including word discovery and generation, by human children is crucially important in understanding the human developmental process. Such a model should not depend on transcribed data, which are often provided manually when researchers train artificial automatic speech recognition and speech synthesis systems. One of the main differences between speech recognition and production by human infants and those by conventional computer systems concerns the access to transcribed data, i.e., supervised learning with transcribed data or unsupervised learning without transcribed data. This study proposes an unsupervised machine learning method for spoken word production that does not use any transcribed data, i.e., the entire system is trained purely using speech signals that the system (the robot) can obtain from its auditory sensor, e.g., a microphone. The method is based on the nonparametric Bayesian double articulation analyzer (NPB-DAA), which is an unsupervised machine learning method that enables a robot to identify word-like and phoneme-like linguistic units in speech signals alone, and a hidden Markov model-based (HMM-based) statistical speech synthesis method, which has been widely used to develop text-to-speech (TTS) systems. Latent letters, i.e., phoneme-like units, and latent words, i.e., word-like units, discovered by the NPB-DAA are used to train the HMM-based TTS system. We present two experiments that used Japanese vowel sequences and an English spoken digit corpus, respectively. Both experimental results showed that the proposed method can produce many spoken words that can be recognized as the original words provided by the human speakers. Furthermore, we discuss future challenges in creating a robot that can autonomously learn phoneme systems and vocabulary only from sensor-motor information.
引用
收藏
页码:156 / 163
页数:8
相关论文
共 36 条
  • [1] Amodei D., 2015, CoRR
  • [2] [Anonymous], INTERSPEECH
  • [3] [Anonymous], 2016, P SSW
  • [4] Chandler D, 2007, SEMIOTICS: THE BASICS, SECOND EDITION, P1
  • [5] Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
    Dahl, George E.
    Yu, Dong
    Deng, Li
    Acero, Alex
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01): : 30 - 42
  • [6] A STICKY HDP-HMM WITH APPLICATION TO SPEAKER DIARIZATION
    Fox, Emily B.
    Sudderth, Erik B.
    Jordan, Michael I.
    Willsky, Alan S.
    [J]. ANNALS OF APPLIED STATISTICS, 2011, 5 (2A) : 1020 - 1056
  • [7] Hannun A., 2014, ARXIV
  • [8] Hollander Myles, 2013, Nonparametric Statistical Methods
  • [9] Johnson MJ, 2013, J MACH LEARN RES, V14, P673
  • [10] Early language acquisition: Cracking the speech code
    Kuhl, PK
    [J]. NATURE REVIEWS NEUROSCIENCE, 2004, 5 (11) : 831 - 843