Open Vocabulary Keyword Spotting through Transfer Learning from Speech Synthesis

被引：0

作者：

Kesavaraj, V ^{[1
]}

Vuppala, Anil ^{[1
]}

机构：

[1] Int Inst Informat Technol Hyderabad, Speech Proc Lab, LTRC, Hyderabad, India

来源：

2024 INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS, SPCOM 2024 | 2024年

关键词：

Transfer learning; Text-to-Speech; Keyword spotting; Tacotron; 2;

D O I：

10.1109/SPCOM60851.2024.10631637

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Identifying keywords in an open-vocabulary context is crucial for personalizing interactions with smart devices. Previous approaches to open vocabulary keyword spotting depend on a shared embedding space created by audio and text encoders. However, these approaches suffer from heterogeneous modality representations (i.e., audio-text mismatch). To address this issue, our proposed framework leverages knowledge acquired from a pre-trained text-to-speech (TTS) system. This knowledge transfer allows for the incorporation of awareness of audio projections into the text representations derived from the text encoder. The performance of the proposed approach is compared with various baseline methods across four different datasets. The robustness of our proposed model is evaluated by assessing its performance across different word lengths and in an Out-of-Vocabulary (OOV) scenario. Additionally, the effectiveness of transfer learning from the TTS system is investigated by analyzing its different intermediate representations. The experimental results indicate that, in the challenging LibriPhrase Hard dataset, the proposed approach outperformed the cross-modality correspondence detector (CMCD) method by a significant improvement of 8.22% in area under the curve (AUC) and 12.56% in equal error rate (EER).

引用

页数：5

共 21 条

[1]

[Anonymous], 2017, The LJ Speech Dataset

[2] COMPARATIVE STUDY OF TOKENIZATION ALGORITHMS FOR END-TO-END OPEN VOCABULARY KEYWORD DETECTION [J].

Gurugubelli, Krishna ;

Mohamed, Sahil ;

Krishna, Rajesh K. S. .

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, :12431-12435

[3] QUERY-BY-EXAMPLE KEYWORD SPOTTING SYSTEM USING MULTI-HEAD ATTENTION AND SOFTTRIPLE LOSS [J].

Huang, Jinmiao ;

Gharbieh, Waseem ;

Shim, Han Suk ;

Kim, Eugene .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6858-6862

[4] Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining [J].

Huang, Wen-Chin ;

Hayashi, Tomoki ;

Wu, Yi-Chiao ;

Kameoka, Hirokazu ;

Toda, Tomoki .

INTERSPEECH 2020, 2020, :4676-4680

[5]

Kim B, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P532, DOI [10.1109/asru46091.2019.9004014, 10.1109/ASRU46091.2019.9004014]

[6]

King DB, 2015, ACS SYM SER, V1214, P1, DOI 10.1021/bk-2015-1214.ch001

[7] PhonMatchNet: Phoneme-Guided Zero-Shot Keyword Spotting for User-Defined Keywords [J].

Lee, Yong-Hyeok ;

Cho, Namhyun .

INTERSPEECH 2023, 2023, :3964-3968

[8]

LEVENSHT.VI, 1965, DOKL AKAD NAUK SSSR+, V163, P845

[9] Neural keyword confidence estimation for open-vocabulary keyword spotting [J].

Liu, Zuozhen ;

Li, Ta ;

Zhang, Pengyuan .

ELECTRONICS LETTERS, 2022, 58 (03) :133-135

[10] Deep Spoken Keyword Spotting: An Overview [J].

Lopez-Espejo, Ivan ;

Tan, Zheng-Hua ;

Hansen, John H. L. ;

Jensen, Jesper .

IEEE ACCESS, 2022, 10 :4169-4199

← 1 2 3 →