Multi-task deep cross-attention networks for far-field speaker verification and keyword spotting

被引:1
|
作者
Liang, Xingwei [1 ,2 ]
Zhang, Zehua [3 ]
Xu, Ruifeng [2 ]
机构
[1] Konka Grp Co Ltd, Shenzhen, Peoples R China
[2] Harbin Inst Technol, Sch Comp Sci & Technol, Shenzhen, Peoples R China
[3] Harbin Inst Technol, Sch Elect & Informat Engn, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
Speaker verification; Keyword spotting; Personalized voice trigger; Flow attention; RECOGNITION;
D O I
10.1186/s13636-023-00293-8
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Personalized voice triggering is a key technology in voice assistants and serves as the first step for users to activate the voice assistant. Personalized voice triggering involves keyword spotting (KWS) and speaker verification (SV). Conventional approaches to this task include developing KWS and SV systems separately. This paper proposes a single system called the multi-task deep cross-attention network (MTCANet) that simultaneously performs KWS and SV, while effectively utilizing information relevant to both tasks. The proposed framework integrates a KWS sub-network and an SV sub-network to enhance performance in challenging conditions such as noisy environments, short-duration speech, and model generalization. At the core of MTCANet are three modules: a novel deep cross-attention (DCA) module to integrate KWS and SV tasks, a multi-layer stacked shared encoder (SE) to reduce the impact of noise on the recognition rate, and soft attention (SA) modules to allow the model to focus on pertinent information in the middle layer while preventing gradient vanishing. Our proposed model demonstrates outstanding performance in the well-off test set, improving by 0.2%, 0.023, and 2.28% over the well-known SV model emphasized channel attention, propagation, and aggregation in time delay neural network (ECAPA-TDNN) and the advanced KWS model Convmixer in terms of equal error rate (EER), minimum detection cost function (minDCF), and accuracy (Acc), respectively.
引用
收藏
页数:16
相关论文
共 33 条
  • [1] Multi-task deep cross-attention networks for far-field speaker verification and keyword spotting
    Xingwei Liang
    Zehua Zhang
    Ruifeng Xu
    EURASIP Journal on Audio, Speech, and Music Processing, 2023
  • [2] MULTI-TASK LEARNING WITH CROSS ATTENTION FOR KEYWORD SPOTTING
    Higuchil, Takuya
    Gupta, Anmol
    Dhir, Chandra
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 571 - 578
  • [3] Jointing Multi-task Learning and Gradient Reversal Layer for Far-Field Speaker Verification
    Xu, Wei
    Wang, Xinghao
    Wan, Hao
    Guo, Xin
    Zhao, Junhong
    Deng, Feiqi
    Kang, Wenxiong
    BIOMETRIC RECOGNITION (CCBR 2021), 2021, 12878 : 449 - 457
  • [4] Multi-Task ConvMixer Networks with Triplet Attention for Low-Resource Keyword Spotting
    Kivaisi, Alexander Rogath
    Zhao, Qingjie
    Zou, Yuanbing
    TSINGHUA SCIENCE AND TECHNOLOGY, 2025, 30 (02): : 875 - 893
  • [5] Multi-task Discriminative Training of Hybrid DNN-TVM Model for Speaker Verification with Noisy and Far-Field Speech
    Jati, Arindam
    Peri, Raghuveer
    Pal, Monisankha
    Park, Tae Jin
    Kumar, Naveen
    Travadi, Ruchir
    Georgiou, Panayiotis
    Narayanan, Shrikanth
    INTERSPEECH 2019, 2019, : 2463 - 2467
  • [6] Multi-Task Network for Noise-Robust Keyword Spotting and Speaker Verification using CTC-based Soft VAD and Global Query Attention
    Jung, Myunghun
    Jung, Youngmoon
    Goo, Jahyun
    Kim, Hoirin
    INTERSPEECH 2020, 2020, : 931 - 935
  • [7] PARAMETERIZED CHANNEL NORMALIZATION FOR FAR-FIELD DEEP SPEAKER VERIFICATION
    Liu, Xuechen
    Sahidullah, Md
    Kinnunen, Tomi
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 1132 - 1138
  • [8] MULTISV: DATASET FOR FAR-FIELD MULTI-CHANNEL SPEAKER VERIFICATION
    Mosner, Ladislav
    Plchot, Oldrich
    Burget, Lukas
    Cernocky, Jan ''Honza''
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7977 - 7981
  • [9] Utilization of age information for speaker verification using multi-task learning deep neural networks
    Kim, Ju-ho
    Heo, Hee-Soo
    Jung, Jee-weon
    Shim, Hye-jin
    Kim, Seung-Bin
    Yu, Ha-Jin
    JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2019, 38 (05): : 593 - 600
  • [10] Multi-task learning of deep neural networks for joint automatic speaker verification and spoofing detection
    Li, Jiakang
    Sun, Meng
    Zhang, Xiongwei
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1517 - 1522