Keyword Spotting in Continuous Speech Using Spectral and Prosodic Information Fusion

被引：3

作者：

Pandey, Laxmi ^{[1
]}

Hegde, Rajesh M. ^{[1
]}

机构：

[1] Indian Inst Technol Kanpur, Dept Elect Engn, Kanpur, Uttar Pradesh, India

来源：

CIRCUITS SYSTEMS AND SIGNAL PROCESSING | 2019年 / 38卷 / 06期

关键词：

Deep denoising autoencoder; Keyword spotting; Hidden Markov models; Deep neural network; Speech recognition;

D O I：

10.1007/s00034-018-0990-6

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Keyword spotting in a continuous speech is a challenging problem and has relevance in applications like audio indexing and music retrieval. In this work, the problem of keyword spotting is addressed by utilizing the complementary information present in spectral and prosodic features of the speech signal. A thorough analysis of the complementary information is performed on a large Hindi language database developed for this purpose. Phonetic and prosodic distribution analysis is performed toward this end, using canonical correlation and Student T-distance function. Motivated by these analyses, novel methods for spectral and prosodic information fusion that optimize a combined error function is proposed. The fusion methods are developed both at the feature and the model level. Improved syllable sequence prediction and keyword spotting performance are obtained using these methods when compared to conventional methods of keyword spotting. Additionally, in order to enable comparison with the state-of-the-art deep learning-based methods, a novel method for improved syllable sequence prediction using deep denoising autoencoders is proposed. The performance of the methods proposed in this work is evaluated for keyword spotting using a syllable sliding protocol over a large Hindi database. Reasonable performance improvements are noted from the experimental results on syllable sequence prediction, keyword spotting, and audio retrieval.

引用

页码：2767 / 2791

页数：25

共 26 条

[1]

[Anonymous], 2015, Nature, DOI [10.1038/nature14539, DOI 10.1038/NATURE14539]

[2]

[Anonymous], P INT ANTW BELG

[3]

[Anonymous], P ICASSP

[4]

[Anonymous], MEASURING MULTIMODAL

[5]

[Anonymous], EUR SPEECH COMM TECH

[6]

[Anonymous], 2006, 2006 IEEE INT S CIRC

[7]

[Anonymous], ACOUST SPEECH SIG PR

[8] Large-Scale Machine Learning with Stochastic Gradient Descent [J].

Bottou, Leon .

COMPSTAT'2010: 19TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL STATISTICS, 2010, :177-186

[9] Prosody in the comprehension of spoken language: A literature review [J].

Cutler, A ;

Dahan, D ;

vanDonselaar, W .

LANGUAGE AND SPEECH, 1997, 40 :141-201

[10] Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition [J].

Dahl, George E. ;

Yu, Dong ;

Deng, Li ;

Acero, Alex .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01) :30-42

← 1 2 3 →