Multi-task deep cross-attention networks for far-field speaker verification and keyword spotting

被引:1
|
作者
Liang, Xingwei [1 ,2 ]
Zhang, Zehua [3 ]
Xu, Ruifeng [2 ]
机构
[1] Konka Grp Co Ltd, Shenzhen, Peoples R China
[2] Harbin Inst Technol, Sch Comp Sci & Technol, Shenzhen, Peoples R China
[3] Harbin Inst Technol, Sch Elect & Informat Engn, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
Speaker verification; Keyword spotting; Personalized voice trigger; Flow attention; RECOGNITION;
D O I
10.1186/s13636-023-00293-8
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Personalized voice triggering is a key technology in voice assistants and serves as the first step for users to activate the voice assistant. Personalized voice triggering involves keyword spotting (KWS) and speaker verification (SV). Conventional approaches to this task include developing KWS and SV systems separately. This paper proposes a single system called the multi-task deep cross-attention network (MTCANet) that simultaneously performs KWS and SV, while effectively utilizing information relevant to both tasks. The proposed framework integrates a KWS sub-network and an SV sub-network to enhance performance in challenging conditions such as noisy environments, short-duration speech, and model generalization. At the core of MTCANet are three modules: a novel deep cross-attention (DCA) module to integrate KWS and SV tasks, a multi-layer stacked shared encoder (SE) to reduce the impact of noise on the recognition rate, and soft attention (SA) modules to allow the model to focus on pertinent information in the middle layer while preventing gradient vanishing. Our proposed model demonstrates outstanding performance in the well-off test set, improving by 0.2%, 0.023, and 2.28% over the well-known SV model emphasized channel attention, propagation, and aggregation in time delay neural network (ECAPA-TDNN) and the advanced KWS model Convmixer in terms of equal error rate (EER), minimum detection cost function (minDCF), and accuracy (Acc), respectively.
引用
收藏
页数:16
相关论文
共 33 条
  • [21] Robust Multi-Channel Far-Field Speaker Verification Under Different In-Domain Data Availability Scenarios
    Qin, Xiaoyi
    Cai, Danwei
    Li, Ming
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 71 - 85
  • [22] Multi-Channel Far-Field Speaker Verification with Large-Scale Ad-hoc Microphone Arrays
    Liang, Chengdong
    Chen, Yijiang
    Yao, Jiadi
    Zhang, Xiao-Lei
    INTERSPEECH 2022, 2022, : 3679 - 3683
  • [23] One-Pass Multi-Task Networks With Cross-Task Guided Attention for Brain Tumor Segmentation
    Zhou, Chenhong
    Ding, Changxing
    Wang, Xinchao
    Lu, Zhentai
    Tao, Dacheng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 4516 - 4529
  • [24] Speaker verification system combining attention-long short term memory based speaker embedding and I-vector in far-field and noisy environments
    Bae, Ara
    Kim, Wooil
    JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2020, 39 (02): : 137 - 142
  • [25] Multi-Task Deep Metric Learning with Boundary Discriminative Information for Cross-Age Face Verification
    Ni, Tongguang
    Gu, Xiaoqing
    Zhang, Cong
    Wang, Weibo
    Fan, Yiqing
    JOURNAL OF GRID COMPUTING, 2020, 18 (02) : 197 - 210
  • [26] Identity, Gender, Age, and Emotion Recognition from Speaker Voice with Multi-task Deep Networks for Cognitive Robotics
    Foggia, Pasquale
    Greco, Antonio
    Roberto, Antonio
    Saggese, Alessia
    Vento, Mario
    COGNITIVE COMPUTATION, 2024, 16 (05) : 2713 - 2723
  • [27] Multi-Task Deep Metric Learning with Boundary Discriminative Information for Cross-Age Face Verification
    Tongguang Ni
    Xiaoqing Gu
    Cong Zhang
    Weibo Wang
    Yiqing Fan
    Journal of Grid Computing, 2020, 18 : 197 - 210
  • [28] Task-Specific Optimization of Virtual Channel Linear Prediction-Based Speech Dereverberation Front-End for Far-Field Speaker Verification
    Yang, Joon-Young
    Chang, Joon-Hyuk
    IEEE/ACM Transactions on Audio Speech and Language Processing, 2022, 30 : 3144 - 3159
  • [29] Task-Specific Optimization of Virtual Channel Linear Prediction-Based Speech Dereverberation Front-End for Far-Field Speaker Verification
    Yang, Joon-Young
    Chang, Joon-Hyuk
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 3144 - 3159
  • [30] Unleash the Black Magic in Age: a Multi-task Deep Neural Network Approach for Cross-age Face Verification
    Wang, Xiaolong
    Zhou, Yin
    Kong, Deguang
    Currey, Jon
    Li, Dawei
    Zhou, Jiayu
    2017 12TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG 2017), 2017, : 596 - 603