Speech synthesis from ECoG using densely connected 3D convolutional neural networks

被引:142
作者
Angrick, Miguel [1 ]
Herff, Christian [1 ,2 ]
Mugler, Emily [3 ]
Tate, Matthew C. [4 ]
Slutzky, Marc W. [3 ,5 ,6 ,7 ]
Krusienski, Dean J. [8 ]
Schultz, Tanja [1 ]
机构
[1] Univ Bremen, Cognit Syst Lab, Bremen, Germany
[2] Maastricht Univ, Sch Mental Hlth & Neurosci, Maastricht, Netherlands
[3] Northwestern Univ, Dept Neurol, Chicago, IL 60611 USA
[4] Northwestern Univ, Dept Neurol Surg, Chicago, IL 60611 USA
[5] Northwestern Univ, Dept Physiol, Chicago, IL 60611 USA
[6] Northwestern Univ, Dept Biomed Engn, Chicago, IL 60611 USA
[7] Northwestern Univ, Dept Phys Med & Rehabil, Chicago, IL 60611 USA
[8] Virginia Commonwealth Univ, Dept Biomed Engn, Richmond, VA USA
基金
美国国家科学基金会;
关键词
speech synthesis; neural networks; Wavenet; electrocorticography; brain-computer interfaces; BCI; DELAYED AUDITORY-FEEDBACK; INTELLIGIBILITY; COMMUNICATION; RESTORATION; DYNAMICS;
D O I
10.1088/1741-2552/ab0c59
中图分类号
R318 [生物医学工程];
学科分类号
0831 ;
摘要
Objective. Direct synthesis of speech from neural signals could provide a fast and natural way of communication to people with neurological diseases. Invasively-measured brain activity (electrocorticography; ECoG) supplies the necessary temporal and spatial resolution to decode fast and complex processes such as speech production. A number of impressive advances in speech decoding using neural signals have been achieved in recent years, but the complex dynamics are still not fully understood. However, it is unlikely that simple linear models can capture the relation between neural activity and continuous spoken speech. Approach. Here we show that deep neural networks can be used to map ECoG from speech production areas onto an intermediate representation of speech (logMel spectrogram). The proposed method uses a densely connected convolutional neural network topology which is well-suited to work with the small amount of data available from each participant. Main results. In a study with six participants, we achieved correlations up to r = 0.69 between the reconstructed and original logMel spectrograms. We transfered our prediction back into an audible waveform by applying a Wavenet vocoder. The vocoder was conditioned on logMel features that harnessed a much larger, pre-existing data corpus to provide the most natural acoustic output. Significance. To the best of our knowledge, this is the first time that high-quality speech has been reconstructed from neural recordings during speech production using deep neural networks.
引用
收藏
页数:10
相关论文
共 72 条
[1]  
Adiga N, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5674, DOI 10.1109/ICASSP.2018.8462393
[2]   Restoration of reaching and grasping movements through brain-controlled muscle stimulation in a person with tetraplegia: a proof-of-concept demonstration [J].
Ajiboye, A. Bolu ;
Willett, Francis R. ;
Young, Daniel R. ;
Memberg, William D. ;
Murphy, Brian A. ;
Miller, Jonathan P. ;
Walter, Benjamin L. ;
Sweet, Jennifer A. ;
Hoyen, Harry A. ;
Keith, Michael W. ;
Peckham, P. Hunter ;
Simeral, John D. ;
Donoghue, John P. ;
Hochberg, Leigh R. ;
Kirsch, Robert F. .
LANCET, 2017, 389 (10081) :1821-1830
[3]  
Angrick M, 2018, 26 EUR S ART NEUR NE, P7
[4]  
[Anonymous], 2015, ARXIV150701239
[5]  
[Anonymous], 2017, 5 INT C LEARN REPR I
[6]   Tool life and tool wear in taper turning of a nickel-based superalloy [J].
Antonialli, A. I. S. ;
Magri, A. ;
Diniz, A. E. .
INTERNATIONAL JOURNAL OF ADVANCED MANUFACTURING TECHNOLOGY, 2016, 87 (5-8) :2023-2032
[7]  
Berezutskaya J, 2017, BENELEARN 2017 P 26, P149
[8]  
Bocquelet Florent, 2016, J Physiol Paris, V110, P392, DOI 10.1016/j.jphysparis.2017.07.002
[9]   Spatio-Temporal Progression of Cortical Activity Related to Continuous Overt and Covert Speech Production in a Reading Task [J].
Brumberg, Jonathan S. ;
Krusienski, Dean J. ;
Chakrabarti, Shreya ;
Gunduz, Aysegul ;
Brunner, Peter ;
Ritaccio, Anthony L. ;
Schalk, Gerwin .
PLOS ONE, 2016, 11 (11)
[10]  
Chakrabarti S, 2015, BIOMED ENG LETT, V5, P10