LOW-LATENCY SPEECH ENHANCEMENT VIA SPEECH TOKEN GENERATION

被引：1

作者：

Xue, Huaying ^{[1
]}

Peng, Xiulian ^{[1
]}

Lu, Yan ^{[1
]}

机构：

[1] Microsoft Res Asia, Beijing, Peoples R China

来源：

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024 | 2024年

关键词：

speech enhancement; speech generation; neural speech coding;

D O I：

10.1109/ICASSP48485.2024.10447774

中图分类号：

学科分类号：

摘要：

Existing deep learning based speech enhancement mainly employ a data-driven approach, which leverage large amounts of data with a variety of noise types to achieve noise removal from noisy signal. However, the high dependence on the data limits its generalization on the unseen complex noises in real-life environment. In this paper, we focus on the low-latency scenario and regard speech enhancement as a speech generation problem conditioned on the noisy signal, where we generate clean speech instead of identifying and removing noises. Specifically, we propose a conditional generative framework for speech enhancement, which models clean speech by acoustic codes of a neural speech codec and generates the speech codes conditioned on past noisy frames in an auto-regressive way. Moreover, we propose an explicit-alignment approach to align noisy frames with the generated speech tokens to improve the robustness and scalability to different input lengths. Different from other methods that leverage multiple stages to generate speech codes, we leverage a single-stage speech generation approach based on the TF-Codec neural codec to achieve high speech quality with low latency. Extensive results on both synthetic and real-recorded test set show its superiority over data-driven approaches in terms of noise robustness and temporal speech coherence.

引用

页码：661 / 665

页数：5

共 50 条

[1] EXPLORING TRADEOFFS IN MODELS FOR LOW-LATENCY SPEECH ENHANCEMENT
Wilson, Kevin
Chinen, Michael
Thorpe, Jeremy
Patton, Brian
Hershey, John
Saurous, Rif A.
Skoglund, Jan
Lyon, Richard F.
2018 16TH INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC), 2018, : 366 - 370
[2] A Survey on Low-Latency DNN-Based Speech Enhancement
Drgas, Szymon
SENSORS, 2023, 23 (03)
[3] Low-Latency Neural Speech Translation
Niehues, Jan
Ngoc-Quan Pham
Thanh-Le Ha
Sperber, Matthias
Waibel, Alex
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1293 - 1297
[4] Efficient Low-Latency Speech Enhancement with Mobile Audio Streaming Networks
Romaniuk, Michal
Masztalski, Piotr
Piaskowski, Karol
Matuszewski, Mateusz
INTERSPEECH 2020, 2020, : 3296 - 3300
[5] DPSNN: spiking neural network for low-latency streaming speech enhancement
Sun, Tao
Bohte, Sander
NEUROMORPHIC COMPUTING AND ENGINEERING, 2024, 4 (04):
[6] LOW-LATENCY DEEP CLUSTERING FOR SPEECH SEPARATION
Wang, Shanshan
Naithani, Gaurav
Virtanen, Tuomas
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 76 - 80
[7] Dynamic Transcription for Low-latency Speech Translation
Niehues, Jan
Nguyen, Thai Son
Cho, Eunah
Ha, Thanh-Le
Kilgour, Kevin
Mueller, Markus
Sperber, Matthias
Stueker, Sebastian
Waibel, Alex
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2513 - 2517
[8] Low-latency monaural speech enhancement with deep filter-bank equalizer
Zheng, Chengshi
Liu, Wenzhe
Li, Andong
Ke, Yuxuan
Li, Xiaodong
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2022, 151 (05): : 3291 - 3304
[9] Implementation of low-latency electrolaryngeal speech enhancement based on multi-task CLDNN
Kobayashi, Kazuhiro
Toda, Tomoki
28TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2020), 2021, : 396 - 400
[10] Amortized Neural Networks for Low-Latency Speech Recognition
Macoskey, Jonathan
Strimel, Grant P.
Su, Jinru
Rastrow, Ariya
INTERSPEECH 2021, 2021, : 4558 - 4562

← 1 2 3 4 5 →