LOW-LATENCY SPEECH ENHANCEMENT VIA SPEECH TOKEN GENERATION

被引:2
作者
Xue, Huaying [1 ]
Peng, Xiulian [1 ]
Lu, Yan [1 ]
机构
[1] Microsoft Res Asia, Beijing, Peoples R China
来源
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024 | 2024年
关键词
speech enhancement; speech generation; neural speech coding;
D O I
10.1109/ICASSP48485.2024.10447774
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Existing deep learning based speech enhancement mainly employ a data-driven approach, which leverage large amounts of data with a variety of noise types to achieve noise removal from noisy signal. However, the high dependence on the data limits its generalization on the unseen complex noises in real-life environment. In this paper, we focus on the low-latency scenario and regard speech enhancement as a speech generation problem conditioned on the noisy signal, where we generate clean speech instead of identifying and removing noises. Specifically, we propose a conditional generative framework for speech enhancement, which models clean speech by acoustic codes of a neural speech codec and generates the speech codes conditioned on past noisy frames in an auto-regressive way. Moreover, we propose an explicit-alignment approach to align noisy frames with the generated speech tokens to improve the robustness and scalability to different input lengths. Different from other methods that leverage multiple stages to generate speech codes, we leverage a single-stage speech generation approach based on the TF-Codec neural codec to achieve high speech quality with low latency. Extensive results on both synthetic and real-recorded test set show its superiority over data-driven approaches in terms of noise robustness and temporal speech coherence.
引用
收藏
页码:661 / 665
页数:5
相关论文
共 50 条
[21]   LACOPE: Latency-Constrained Pitch Estimation for Speech Enhancement [J].
Schroeter, Hendrik ;
Rosenkranz, Tobias ;
Escalante-B, Alberto N. ;
Maier, Andreas .
INTERSPEECH 2021, 2021, :656-660
[22]   Speech Inventory Based Discriminative Training for Joint Speech Enhancement and Low-Rate Speech Coding [J].
Xiao, Xiaoqiang ;
Nickel, Robert M. .
11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, :2398-+
[23]   IMPROVEMENT OF SPEECH RESIDUALS FOR SPEECH ENHANCEMENT [J].
Elshamy, Samy ;
Fingscheidt, Tim .
2019 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), 2019, :219-223
[24]   Speech Enhancement Method under Low SNR [J].
Cai Wenlong ;
Ma Guang .
ADVANCED MATERIALS AND ENGINEERING MATERIALS, PTS 1 AND 2, 2012, 457-458 :1490-1493
[25]   NOISE ESTIMATION WITH LOW COMPLEXITY FOR SPEECH ENHANCEMENT [J].
Yong, Pei Chee ;
Nordholm, Sven ;
Dam, Hai Huyen .
2011 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), 2011, :109-112
[26]   USING SPEECH ENHANCEMENT TO REALIZE SPEECH SYNTHESIS OF LOW-RESOURCE DUNGAN LANGUAGES [J].
Jiang, Rui ;
Chen, Chengsi ;
Shan, Xin ;
Yang, Hongwu .
2021 24TH CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA), 2021, :193-198
[27]   A PERCEPTUALLY MOTIVATED APPROACH VIA SPARSE AND LOW-RANK MODEL FOR SPEECH ENHANCEMENT [J].
Min, Gang ;
Zhang, Xiongwei ;
Yang, Jibin ;
Han, Wei ;
Zou, Xia .
2016 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO (ICME), 2016,
[28]   Speech Enhancement via Low-rank Matrix Decomposition and Image Based Masking [J].
Liu, Liyang ;
Ding, Zhaogui ;
Li, Weifeng ;
Wang, Longbiao ;
Liao, Qingmin .
2014 9TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2014, :389-+
[29]   SPEECH ENHANCEMENT VIA GENERATIVE ADVERSARIAL LSTM NETWORKS [J].
Xiang, Yang ;
Bao, Changchun .
2018 16TH INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC), 2018, :46-50
[30]   NETWORKS FOR SPEECH ENHANCEMENT AND AUTOMATIC SPEECH RECOGNITION [J].
Vu, Thanh T. ;
Bigot, Benjamin ;
Chng, Eng Siong .
2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, :499-503