Multi-task deep cross-attention networks for far-field speaker verification and keyword spotting

被引:1
|
作者
Liang, Xingwei [1 ,2 ]
Zhang, Zehua [3 ]
Xu, Ruifeng [2 ]
机构
[1] Konka Grp Co Ltd, Shenzhen, Peoples R China
[2] Harbin Inst Technol, Sch Comp Sci & Technol, Shenzhen, Peoples R China
[3] Harbin Inst Technol, Sch Elect & Informat Engn, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
Speaker verification; Keyword spotting; Personalized voice trigger; Flow attention; RECOGNITION;
D O I
10.1186/s13636-023-00293-8
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Personalized voice triggering is a key technology in voice assistants and serves as the first step for users to activate the voice assistant. Personalized voice triggering involves keyword spotting (KWS) and speaker verification (SV). Conventional approaches to this task include developing KWS and SV systems separately. This paper proposes a single system called the multi-task deep cross-attention network (MTCANet) that simultaneously performs KWS and SV, while effectively utilizing information relevant to both tasks. The proposed framework integrates a KWS sub-network and an SV sub-network to enhance performance in challenging conditions such as noisy environments, short-duration speech, and model generalization. At the core of MTCANet are three modules: a novel deep cross-attention (DCA) module to integrate KWS and SV tasks, a multi-layer stacked shared encoder (SE) to reduce the impact of noise on the recognition rate, and soft attention (SA) modules to allow the model to focus on pertinent information in the middle layer while preventing gradient vanishing. Our proposed model demonstrates outstanding performance in the well-off test set, improving by 0.2%, 0.023, and 2.28% over the well-known SV model emphasized channel attention, propagation, and aggregation in time delay neural network (ECAPA-TDNN) and the advanced KWS model Convmixer in terms of equal error rate (EER), minimum detection cost function (minDCF), and accuracy (Acc), respectively.
引用
收藏
页数:16
相关论文
共 33 条
  • [31] HiLAM-state discriminative multi-task deep neural network in dynamic time warping framework for text-dependent speaker verification
    Laskar, Mohammad Azharuddin
    Laskar, Rabul Hussain
    SPEECH COMMUNICATION, 2020, 121 (29-43) : 29 - 43
  • [32] Improving the Performance of Far-Field Speaker Verification Using Multi-Condition Training: The Case of GMM-UBM and i-vector Systems
    Avila, Anderson R.
    Sarria-Paja, Milton
    Fraga, Francisco J.
    O'Shaughnessy, Douglas
    Falk, Tiago H.
    15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 1096 - 1100
  • [33] ADST: Forecasting Metro Flow Using Attention-Based Deep Spatial-Temporal Networks with Multi-Task Learning
    Jia, Hongwei
    Luo, Haiyong
    Wang, Hao
    Zhao, Fang
    Ke, Qixue
    Wu, Mingyao
    Zhao, Yunyun
    SENSORS, 2020, 20 (16) : 1 - 23