Multi-task deep cross-attention networks for far-field speaker verification and keyword spotting

被引：1

作者：

Liang, Xingwei ^{[1
,2
]}

Zhang, Zehua ^{[3
]}

Xu, Ruifeng ^{[2
]}

机构：

[1] Konka Grp Co Ltd, Shenzhen, Peoples R China

[2] Harbin Inst Technol, Sch Comp Sci & Technol, Shenzhen, Peoples R China

[3] Harbin Inst Technol, Sch Elect & Informat Engn, Shenzhen, Peoples R China

来源：

EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING | 2023年 / 2023卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Speaker verification; Keyword spotting; Personalized voice trigger; Flow attention; RECOGNITION;

D O I：

10.1186/s13636-023-00293-8

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Personalized voice triggering is a key technology in voice assistants and serves as the first step for users to activate the voice assistant. Personalized voice triggering involves keyword spotting (KWS) and speaker verification (SV). Conventional approaches to this task include developing KWS and SV systems separately. This paper proposes a single system called the multi-task deep cross-attention network (MTCANet) that simultaneously performs KWS and SV, while effectively utilizing information relevant to both tasks. The proposed framework integrates a KWS sub-network and an SV sub-network to enhance performance in challenging conditions such as noisy environments, short-duration speech, and model generalization. At the core of MTCANet are three modules: a novel deep cross-attention (DCA) module to integrate KWS and SV tasks, a multi-layer stacked shared encoder (SE) to reduce the impact of noise on the recognition rate, and soft attention (SA) modules to allow the model to focus on pertinent information in the middle layer while preventing gradient vanishing. Our proposed model demonstrates outstanding performance in the well-off test set, improving by 0.2%, 0.023, and 2.28% over the well-known SV model emphasized channel attention, propagation, and aggregation in time delay neural network (ECAPA-TDNN) and the advanced KWS model Convmixer in terms of equal error rate (EER), minimum detection cost function (minDCF), and accuracy (Acc), respectively.

引用

页数：16

共 33 条

[21] Robust Multi-Channel Far-Field Speaker Verification Under Different In-Domain Data Availability Scenarios
Qin, Xiaoyi
Cai, Danwei
Li, Ming
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 71 - 85
[22] Multi-Channel Far-Field Speaker Verification with Large-Scale Ad-hoc Microphone Arrays
Liang, Chengdong
Chen, Yijiang
Yao, Jiadi
Zhang, Xiao-Lei
INTERSPEECH 2022, 2022, : 3679 - 3683
[23] One-Pass Multi-Task Networks With Cross-Task Guided Attention for Brain Tumor Segmentation
Zhou, Chenhong
Ding, Changxing
Wang, Xinchao
Lu, Zhentai
Tao, Dacheng
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 4516 - 4529
[24] Speaker verification system combining attention-long short term memory based speaker embedding and I-vector in far-field and noisy environments
Bae, Ara
Kim, Wooil
JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2020, 39 (02): : 137 - 142
[25] Multi-Task Deep Metric Learning with Boundary Discriminative Information for Cross-Age Face Verification
Ni, Tongguang
Gu, Xiaoqing
Zhang, Cong
Wang, Weibo
Fan, Yiqing
JOURNAL OF GRID COMPUTING, 2020, 18 (02) : 197 - 210
[26] Identity, Gender, Age, and Emotion Recognition from Speaker Voice with Multi-task Deep Networks for Cognitive Robotics
Foggia, Pasquale
Greco, Antonio
Roberto, Antonio
Saggese, Alessia
Vento, Mario
COGNITIVE COMPUTATION, 2024, 16 (05) : 2713 - 2723
[27] Multi-Task Deep Metric Learning with Boundary Discriminative Information for Cross-Age Face Verification
Tongguang Ni
Xiaoqing Gu
Cong Zhang
Weibo Wang
Yiqing Fan
Journal of Grid Computing, 2020, 18 : 197 - 210
[28] Task-Specific Optimization of Virtual Channel Linear Prediction-Based Speech Dereverberation Front-End for Far-Field Speaker Verification
Yang, Joon-Young
Chang, Joon-Hyuk
IEEE/ACM Transactions on Audio Speech and Language Processing, 2022, 30 : 3144 - 3159
[29] Task-Specific Optimization of Virtual Channel Linear Prediction-Based Speech Dereverberation Front-End for Far-Field Speaker Verification
Yang, Joon-Young
Chang, Joon-Hyuk
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 3144 - 3159
[30] Unleash the Black Magic in Age: a Multi-task Deep Neural Network Approach for Cross-age Face Verification
Wang, Xiaolong
Zhou, Yin
Kong, Deguang
Currey, Jon
Li, Dawei
Zhou, Jiayu
2017 12TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG 2017), 2017, : 596 - 603

← 1 2 3 4 →