End-to-End ASR-Free Keyword Search From Speech

被引：70

作者：

Audhkhasi, Kartik ^{[1
]}

Rosenberg, Andrew ^{[1
]}

Sethy, Abhinav ^{[1
]}

Ramabhadran, Bhuvana ^{[1
]}

Kingsbury, Brian ^{[1
]}

机构：

[1] IBM Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA

来源：

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING | 2017年 / 11卷 / 08期

关键词：

End-to-end systems; neural networks; keyword search; automatic speech recognition;

D O I：

10.1109/JSTSP.2017.2759726

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Conventional keyword search (KWS) systems for speech databases match the input text query to the set of word hypotheses generated by an automatic speech recognition (ASR) system from utterances in the database. Hence, such KWS systems attempt to solve the complex problem of ASR as a precursor. Training an ASR system itself is a time-consuming process requiring transcribed speech data. Our prior work presented an ASR-free end-to-end system that needed minimal supervision and trained significantly faster than an ASR-based KWS system. The ASR-free KWS system consisted of three subsystems. The first subsystem was a recurrent neural network based acoustic encoder that extracted a finite-dimensional embedding of the speech utterance. The second subsystem was a query encoder that produced an embedding of the input text query. The acoustic and query embeddings were input to a feedforward neural network that predicted whether the query occurred in the acoustic utterance or not. This paper extends our prior work in several ways. First, we significantly improve upon our previous ASR-free KWS results by nearly 20% relative through improvements to the acoustic encoder. Next, we show that it is possible to train the acoustic encoder on languages other than the language of interest with only a small drop in KWS performance. Finally, we attempt to predict the location of the detected keywords by training a location-sensitive KWS network.

引用

页码：1351 / 1359

页数：9

共 32 条

[1]

[Anonymous], 1989, Advances in neural information processing systems

[2]

[Anonymous], 2016, P INTERSPEECH

[3] Direct Acoustics-to-Word Models for English Conversational Speech Recognition [J].

Audhkhasi, Kartik ;

Ramabhadran, Bhuvana ;

Saon, George ;

Picheny, Michael ;

Nahamoo, David .

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :959-963

[4]

Audhkhasi K, 2017, INT CONF ACOUST SPEE, P4840, DOI 10.1109/ICASSP.2017.7953076

[5]

Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, DOI 10.48550/ARXIV.1409.0473]

[6]

Bandanau D, 2016, INT CONF ACOUST SPEE, P4945, DOI 10.1109/ICASSP.2016.7472618

[7]

Bourlard Herve A, 2012, Connectionist speech recognition: a hybrid approach, V247

[8] Lattice Indexing for Spoken Term Detection [J].

Can, Dogan ;

Saraclar, Murat .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (08) :2338-2347

[9]

Chen GG, 2015, INT CONF ACOUST SPEE, P5236, DOI 10.1109/ICASSP.2015.7178970

[10]

Chollet Francois., 2015, Keras

← 1 2 3 4 →