The Representation of Speech in Deep Neural Networks

被引:3
作者
Scharenborg, Odette [1 ,2 ]
van der Gouw, Nikki [2 ]
Larson, Martha [1 ,2 ]
Marchiori, Elena [2 ]
机构
[1] Delft Univ Technol, Multimedia Comp Grp, Delft, Netherlands
[2] Radboud Univ Nijmegen, Nijmegen, Netherlands
来源
MULTIMEDIA MODELING, MMM 2019, PT II | 2019年 / 11296卷
关键词
Deep neural networks; Speech representations; Visualizations;
D O I
10.1007/978-3-030-05716-9_16
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we investigate the connection between how people understand speech and how speech is understood by a deep neural network. A naive, general feed-forward deep neural network was trained for the task of vowel/consonant classification. Subsequently, the representations of the speech signal in the different hidden layers of the DNN were visualized. The visualizations allow us to study the distance between the representations of different types of input frames and observe the clustering structures formed by these representations. In the different visualizations, the input frames were labeled with different linguistic categories: sounds in the same phoneme class, sounds with the same manner of articulation, and sounds with the same place of articulation. We investigate whether the DNN clusters speech representations in a way that corresponds to these linguistic categories and observe evidence that the DNN does indeed appear to learn structures that humans use to understand speech without being explicitly trained to do so.
引用
收藏
页码:194 / 205
页数:12
相关论文
共 14 条
[1]   A comparison of automatic and human speech recognition in null grammar [J].
Juneja, Amit .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2012, 131 (03) :EL256-EL261
[2]   Large-scale Video Classification with Convolutional Neural Networks [J].
Karpathy, Andrej ;
Toderici, George ;
Shetty, Sanketh ;
Leung, Thomas ;
Sukthankar, Rahul ;
Fei-Fei, Li .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :1725-1732
[3]   ImageNet Classification with Deep Convolutional Neural Networks [J].
Krizhevsky, Alex ;
Sutskever, Ilya ;
Hinton, Geoffrey E. .
COMMUNICATIONS OF THE ACM, 2017, 60 (06) :84-90
[4]   Phonological abstraction in the mental lexicon [J].
McQueen, James M. ;
Cutler, Anne ;
Norris, Dennis .
COGNITIVE SCIENCE, 2006, 30 (06) :1113-1126
[5]  
Mohamed AR, 2012, INT CONF ACOUST SPEE, P4273, DOI 10.1109/ICASSP.2012.6288863
[6]   Acoustic Modeling Using Deep Belief Networks [J].
Mohamed, Abdel-rahman ;
Dahl, George E. ;
Hinton, Geoffrey .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01) :14-22
[7]  
Oostdijk N., 2002, LREC 2002, P340
[8]   Visualizing the Hidden Activity of Artificial Neural Networks [J].
Rauber, Paulo E. ;
Fadel, Samuel G. ;
Falcao, Alexandre X. ;
Telea, Alexandru C. .
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2017, 23 (01) :101-110
[9]   Visualizing Phoneme Category Adaptation in Deep Neural Networks [J].
Scharenborg, Odette ;
Tiesmeyer, Sebastian ;
Hasegawa-Johnson, Mark ;
Dehak, Najim .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :1482-1486
[10]  
Van den Oord A., 2013, P NIPS