3-D CNN MODELS FOR FAR-FIELD MULTI-CHANNEL SPEECH RECOGNITION

被引:0
|
作者
Ganapathy, Sriram [1 ]
Peddinti, Vijayaditya [2 ]
机构
[1] Indian Inst Sci, Bangalore, Karnataka, India
[2] Google Inc, Mountain View, CA USA
来源
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2018年
关键词
Far-field speech recognition; 3D CNN modeling; Multi-party conversational speech; NEURAL-NETWORKS; CORPUS;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Automatic speech recognition (ASR) in far-field reverberant environments, especially when involving natural conversational multiparty speech conditions, is challenging even with the state-of-theart recognition methodologies. The two main issues are artifacts in the signal due to reverberation and the presence of multiple speakers. In this paper, we propose a three dimensional (3-D) convolutional neural network (CNN) architecture for multi-channel far-field ASR. This architecture processes time, frequency & channel dimensions of the input spectrogram to learn representations using convolutional layers. Experiments are performed on the REVERB challenge LVCSR task and the augmented multi-party (AMI) LVCSR task using the array microphone recordings. The proposed method shows improvements over the baseline system that uses beamforming of the multi-channel audio along with a 2-D conventional CNN framework (absolute improvements of 1.1 % over the beamformed baseline system on AMI dataset).
引用
收藏
页码:5499 / 5503
页数:5
相关论文
共 22 条
  • [21] Learning Multi-View Representation With LSTM for 3-D Shape Recognition and Retrieval
    Ma, Chao
    Guo, Yulan
    Yang, Jungang
    An, Wei
    IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (05) : 1169 - 1182
  • [22] A 3D-CNN and LSTM Based Multi-Task Learning Architecture for Action Recognition
    Ouyang, Xi
    Xu, Shuangjie
    Zhang, Chaoyun
    Zhou, Pan
    Yang, Yang
    Liu, Guanghui
    Li, Xuelong
    IEEE ACCESS, 2019, 7 : 40757 - 40770