Direct Speech Reconstruction From Articulatory Sensor Data by Machine Learning

被引:53
作者
Gonzalez, Jose A. [1 ]
Cheah, Lam A. [2 ]
Gomez, Angel M. [3 ]
Green, Phil D. [1 ]
Gilbert, James M. [2 ]
Ell, Stephen R. [4 ]
Moore, Roger K. [1 ]
Holdsworth, Ed [5 ]
机构
[1] Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England
[2] Univ Hull, Sch Engn, Kingston Upon Hull HU6 7RX, N Humberside, England
[3] Univ Granada, Dept Signal Theory Telemat & Commun, Granada 18010, Spain
[4] Hull & East Yorkshire Hosp Trust, Castle Hill Hosp, Cottingham HU16 5JQ, England
[5] Pract Control Ltd, Sheffield S9 2RS, S Yorkshire, England
基金
美国国家卫生研究院;
关键词
Silent speech interfaces; articulatory-to-acoustic mapping; speech rehabilitation; permanent magnet articulography; speech synthesis; DEEP NEURAL-NETWORKS; VOCAL-TRACT; CONVERSION; RECOGNITION; REPRESENTATIONS; TRANSFORMATION; SYSTEM;
D O I
10.1109/TASLP.2017.2757263
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper describes a technique that generates speech acoustics from articulator movements. Our motivation is to help people who can no longer speak following laryngectomy, a procedure that is carried out tens of thousands of times per year in the Western world. Our method for sensing articulator movement, permanent magnetic articulography, relies on small, unobtrusive magnets attached to the lips and tongue. Changes in magnetic field caused by magnet movements are sensed and form the input to a process that is trained to estimate speech acoustics. In the experiments reported here this "Direct Synthesis" technique is developed for normal speakers, with glued-on magnets, allowing us to train with parallel sensor and acoustic data. We describe three machine learning techniques for this task, based on Gaussian mixture models, deep neural networks, and recurrent neural networks (RNNs). We evaluate our techniques with objective acoustic distortion measures and subjective listening tests over spoken sentences read from novels (the CMU Arctic corpus). Our results show that the best performing technique is a bidirectional RNN (BiRNN), which employs both past and future contexts to predict the acoustics from the sensor data. BiRNNs are not suitable for synthesis in real time but fixed-lag RNNs give similar results and, because they only look a little way into the future, overcome this problem. Listening tests show that the speech produced by this method has a natural quality that preserves the identity of the speaker. Furthermore, we obtain up to 92% intelligibility on the challenging CMU Arctic material. To our knowledge, these are the best results obtained for a silent-speech system without a restricted vocabulary and with an unobtrusive device that delivers audio in close to real time. This work promises to lead to a technology that truly will give people whose larynx has been removed their voices back.
引用
收藏
页码:2362 / 2374
页数:13
相关论文
共 51 条
  • [1] [Anonymous], 2004, 5 ISCA WORKSH SPEECH
  • [2] [Anonymous], 2014, PROC INTERSPEECH 201
  • [3] [Anonymous], 1987, Technical Report CUED/F-INFENG/TR.1
  • [4] Data driven articulatory synthesis with deep neural networks
    Aryal, Sandesh
    Gutierrez-Osuna, Ricardo
    [J]. COMPUTER SPEECH AND LANGUAGE, 2016, 36 : 260 - 273
  • [5] INVERSION OF ARTICULATORY-TO-ACOUSTIC TRANSFORMATION IN VOCAL-TRACT BY A COMPUTER-SORTING TECHNIQUE
    ATAL, BS
    CHANG, JJ
    MATHEWS, MV
    TUKEY, JW
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1978, 63 (05) : 1535 - 1555
  • [6] Bishop C.M., 2006, PATTERN RECOGN, V4, P738, DOI DOI 10.1117/1.2819119
  • [7] Real-Time Control of an Articulatory-Based Speech Synthesizer for Brain Computer Interfaces
    Bocquelet, Florent
    Hueber, Thomas
    Girin, Laurent
    Savariaux, Christophe
    Yvert, Blaise
    [J]. PLOS COMPUTATIONAL BIOLOGY, 2016, 12 (11)
  • [8] Brain-computer interfaces for speech communication
    Brumberg, Jonathan S.
    Nieto-Castanon, Alfonso
    Kennedy, Philip R.
    Guenther, Frank H.
    [J]. SPEECH COMMUNICATION, 2010, 52 (04) : 367 - 379
  • [9] Cheah Lam A., 2015, BIOSIGNALS 2015. 8th International Conference on Bio-Inspired Systems and Signal Processing. Proceedings, P109
  • [10] Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training
    Chen, Ling-Hui
    Ling, Zhen-Hua
    Liu, Li-Juan
    Dai, Li-Rong
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (12) : 1859 - 1872