Feature Frame Stacking in RNN-based Tandem ASR Systems - Learned vs. Predefined Context

被引：0

作者：

Woellmer, Martin ^{[1
]}

Schuller, Bjoern ^{[1
]}

Rigoll, Gerhard ^{[1
]}

机构：

[1] Tech Univ Munich, Inst Human Machine Commun, D-80290 Munich, Germany

来源：

12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5 | 2011年

关键词：

context modeling; long short-term memory; recurrent neural networks; automatic speech recognition; BIDIRECTIONAL LSTM; NETWORKS;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

As phoneme recognition is known to profit from techniques that consider contextual information, neural networks applied in Tandem automatic speech recognition (ASR) systems usually employ some form of context modeling. While approaches based on multi-layer perceptrons or recurrent neural networks (RNN) are able to model a predefined amount of context by simultaneously processing a stacked sequence of successive feature vectors, bidirectional Long Short-Term Memory (BLSTM) networks were shown to be well-suited for incorporating a self-learned amount of context for phoneme prediction. In this paper, we evaluate combinations of BLSTM modeling and frame stacking to determine the most efficient method for exploiting context in RNN-based Tandem systems. Applying the COSINE corpus and our recently introduced multi-stream BLSTM-HMM decoder, we provide empirical evidence for the intuition that BLSTM networks redundantize frame stacking while RNNs profit from predefined feature-level context.

引用

页码：1240 / 1243

页数：4