Temporal Alignment for Deep Neural Networks

被引:0
作者
Lin, Payton [1 ]
Lyu, Dau-Cheng [2 ]
Chang, Yun-Fan [1 ]
Tsao, Yu [1 ]
机构
[1] Acad Sinica, Res Ctr Informat Technol Innovat, Taipei, Taiwan
[2] ASUS Headquarters, Adv Technol Div, Kauhsiung, Taiwan
来源
2015 IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (GLOBALSIP) | 2015年
关键词
alignment; temporal features; deep neural networks; HIDDEN MARKOV-MODELS; SPEECH RECOGNITION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Alternative features were derived from extracted temporal envelope bank (TBANK). These simplified temporal representations were investigated in alignment procedures to generate frame-level training labels for deep neural networks (DNNs). TBANK features improved temporal alignments both for supervised training and for context dependent tree building.
引用
收藏
页码:108 / 112
页数:5
相关论文
共 30 条
[1]  
[Anonymous], 2015, P ICASSP
[2]  
[Anonymous], 2010, P NIPS WORKSH DEEP L
[3]  
[Anonymous], 2014, P ICASSP
[4]  
[Anonymous], 1973, Pattern Classification and Scene Analysis
[5]  
Bacchiani M., 2014, P INTERSPEECH
[6]   DECODING FOR CHANNELS WITH INSERTIONS, DELETIONS, AND SUBSTITUTIONS WITH APPLICATIONS TO SPEECH RECOGNITION [J].
BAHL, LR ;
JELINEK, F .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1975, 21 (04) :404-411
[7]   Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition [J].
Dahl, George E. ;
Yu, Dong ;
Deng, Li ;
Acero, Alex .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01) :30-42
[8]   PHONEMIC HIDDEN MARKOV-MODELS WITH CONTINUOUS MIXTURE OUTPUT DENSITIES FOR LARGE VOCABULARY WORD RECOGNITION [J].
DENG, L ;
KENNY, P ;
LENNIG, M ;
GUPTA, V ;
SEITZ, F ;
MERMELSTEIN, P .
IEEE TRANSACTIONS ON SIGNAL PROCESSING, 1991, 39 (07) :1677-1681
[9]  
DENG L, 1991, P IEEE WORKSH NEUR N, P411
[10]   Speech Recognition Using Hidden Markov Models with Polynomial Regression Functions as Nonstationary States [J].
Deng, Li ;
Aksmanovic, Mike ;
Sun, Xiaodong ;
Wu, C. F. Jeff .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1994, 2 (04) :507-520