Fusion of Spectral and Prosodic Information using Combined Error Optimization for Keyword Spotting

被引：0

作者：

Pandey, Laxmi ^{[1
]}

Chaudhary, Kuldeep ^{[1
]}

Hegde, Rajesh M. ^{[1
]}

机构：

[1] Indian Inst Technol, Dept Elect Engn, Kanpur, Uttar Pradesh, India

来源：

2017 TWENTY-THIRD NATIONAL CONFERENCE ON COMMUNICATIONS (NCC) | 2017年

关键词：

HIDDEN MARKOV-MODELS; SPEECH RECOGNITION;

D O I：

暂无

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Incorporating prosodic information with spectral information at the feature level is challenging. In this paper, a method for feature level fusion of spectral and prosodic information is proposed. A pitch contour is first extracted from the frame blocked segments of the speech signal. These speech segments obtained herein are labeled as high pitch and low pitch segments. Both spectral and prosodic features are extracted from each segment class. An integrated feature set is obtained by concatenating spectral and prosodic features from each of these classes. In the next stage of fusion, the high and low pitch labeled features are further combined using joint error optimization approach. This optimization approach assumes that the mean of the high pitch segments can be obtained by an affine transformation on the mean of the low pitch segments. The parameters of the affine transformation are obtained using the gradient descent approach. The final integrated feature set is obtained after normalization of both sets of features thus obtained. This integrated feature set is used in a Hidden Markov Modeling (HMM) framework along with a novel sliding syllable protocol for keyword spotting. Keyword spotting experiments are conducted on the Hindi language database developed for this purpose. Experiments on keyword recognition and keyword spotting are conducted to evaluate the performance of the proposed fusion method. Experimental results obtained in terms of WER and receiver operating characteristics indicate a reasonable improvements over the use of a single feature set like the MFCC.

引用

页数：6

共 17 条

[1] [Anonymous], 2018, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition
[2] Carnegie Mellon University, CMUSPH OP SOURC TOOL
[3] Prosody in the comprehension of spoken language: A literature review
Cutler, A
Dahan, D
vanDonselaar, W
[J]. LANGUAGE AND SPEECH, 1997, 40 : 141 - 201
[4] Hidden Markov models
Eddy, SR
[J]. CURRENT OPINION IN STRUCTURAL BIOLOGY, 1996, 6 (03) : 361 - 365
[5] Maximum likelihood linear transformations for HMM-based speech recognition
Gales, MJF
[J]. COMPUTER SPEECH AND LANGUAGE, 1998, 12 (02) : 75 - 98
[6] International Phonetic Association, 1999, HDB INT PHON ASS GUI
[7] Mathew B., 2003, Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, P210, DOI DOI 10.1145/951710.951739
[8] Weighted finite-state transducers in speech recognition
Mohri, M
Pereira, F
Riley, M
[J]. COMPUTER SPEECH AND LANGUAGE, 2002, 16 (01) : 69 - 88
[9] Placeway P., 1997, DARPA SPEECH RECOGNI, P85
[10] Povey Daniel, 2010, SUBSPACE GAUSSIAN MI

← 1 2 →