Deep Neural Network Based Acoustic-to-articulatory Inversion Using Phone Sequence Information

被引:13
作者
Xie, Xurong [1 ,3 ]
Liu, Xunying [1 ,2 ]
Wang, Lan [1 ,3 ]
机构
[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, Key Lab Human Machine Intelligence Synergy Syst, Beijing, Peoples R China
[2] Univ Cambridge, Engn Dept, Trumpington St, Cambridge CB2 1PZ, England
[3] Chinese Univ Hong Kong, Hong Kong, Hong Kong, Peoples R China
来源
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES | 2016年
基金
中国国家自然科学基金;
关键词
acoustic-to-articulatory inversion; deep neural network; bottleneck feature; phone sequence; MOVEMENTS; FEATURES;
D O I
10.21437/Interspeech.2016-659
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In recent years, neural network based acoustic-to-articulatory inversion approaches have achieved the state-of-the-art performance. One major issue associated with these approaches is the lack of phone sequence information during inversion. In order to address this issue, this paper proposes an improved architecture hierarchically concatenating phone classification and articulatory inversion component DNNs to improve articulatory movement generation. On a Mandarin Chinese speech inversion task, the proposed technique consistently outperformed a range of baseline DNN and RNN inversion systems constructed using no phone sequence information, a mixture density parameter output layer, additional phone features at the input layer, or multi-task learning with additional monophone output layer target labels, measured in terms of electromagnetic articulography (EMA) root mean square error (RMSE) and correlation. Further improvements were obtained using the bottleneck features extracted from the proposed hierarchical articulatory inversion systems as auxiliary features in generalized variable parameter HMMs (GVP-HMMs) based inversion systems.
引用
收藏
页码:1497 / 1501
页数:5
相关论文
共 33 条
  • [1] [Anonymous], 2009, HTK BOOK VERSION 3 4
  • [2] Atal B.S., 1989, J ACOUST SOC AM, V86, pS67
  • [3] BAER T, 1987, Magnetic Resonance Imaging, V5, P1, DOI 10.1016/0730-725X(87)90477-2
  • [4] Bell P, 2015, INT CONF ACOUST SPEE, P4290, DOI 10.1109/ICASSP.2015.7178780
  • [5] SOLUTION OF VANDERMONDE SYSTEMS OF EQUATIONS
    BJORCK, A
    PEREYRA, V
    [J]. MATHEMATICS OF COMPUTATION, 1970, 24 (112) : 893 - &
  • [6] Multitask learning
    Caruana, R
    [J]. MACHINE LEARNING, 1997, 28 (01) : 41 - 75
  • [7] Cheng N., 2011, P ISCA INTERSPEECH F, P482
  • [8] A flexible framework for HMM based noise robust speech recognition using generalized parametric space polynomial regression
    Cheng Ning
    Liu XunYing
    Wang Lan
    [J]. SCIENCE CHINA-INFORMATION SCIENCES, 2011, 54 (12) : 2481 - 2491
  • [9] Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
    Dahl, George E.
    Yu, Dong
    Deng, Li
    Acero, Alex
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01): : 30 - 42
  • [10] Grézl F, 2007, INT CONF ACOUST SPEE, P757