LOW-LATENCY SPEECH ENHANCEMENT VIA SPEECH TOKEN GENERATION

被引:1
|
作者
Xue, Huaying [1 ]
Peng, Xiulian [1 ]
Lu, Yan [1 ]
机构
[1] Microsoft Res Asia, Beijing, Peoples R China
来源
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024 | 2024年
关键词
speech enhancement; speech generation; neural speech coding;
D O I
10.1109/ICASSP48485.2024.10447774
中图分类号
学科分类号
摘要
Existing deep learning based speech enhancement mainly employ a data-driven approach, which leverage large amounts of data with a variety of noise types to achieve noise removal from noisy signal. However, the high dependence on the data limits its generalization on the unseen complex noises in real-life environment. In this paper, we focus on the low-latency scenario and regard speech enhancement as a speech generation problem conditioned on the noisy signal, where we generate clean speech instead of identifying and removing noises. Specifically, we propose a conditional generative framework for speech enhancement, which models clean speech by acoustic codes of a neural speech codec and generates the speech codes conditioned on past noisy frames in an auto-regressive way. Moreover, we propose an explicit-alignment approach to align noisy frames with the generated speech tokens to improve the robustness and scalability to different input lengths. Different from other methods that leverage multiple stages to generate speech codes, we leverage a single-stage speech generation approach based on the TF-Codec neural codec to achieve high speech quality with low latency. Extensive results on both synthetic and real-recorded test set show its superiority over data-driven approaches in terms of noise robustness and temporal speech coherence.
引用
收藏
页码:661 / 665
页数:5
相关论文
共 50 条
  • [1] EXPLORING TRADEOFFS IN MODELS FOR LOW-LATENCY SPEECH ENHANCEMENT
    Wilson, Kevin
    Chinen, Michael
    Thorpe, Jeremy
    Patton, Brian
    Hershey, John
    Saurous, Rif A.
    Skoglund, Jan
    Lyon, Richard F.
    2018 16TH INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC), 2018, : 366 - 370
  • [2] A Survey on Low-Latency DNN-Based Speech Enhancement
    Drgas, Szymon
    SENSORS, 2023, 23 (03)
  • [3] Low-Latency Neural Speech Translation
    Niehues, Jan
    Ngoc-Quan Pham
    Thanh-Le Ha
    Sperber, Matthias
    Waibel, Alex
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1293 - 1297
  • [4] Efficient Low-Latency Speech Enhancement with Mobile Audio Streaming Networks
    Romaniuk, Michal
    Masztalski, Piotr
    Piaskowski, Karol
    Matuszewski, Mateusz
    INTERSPEECH 2020, 2020, : 3296 - 3300
  • [5] DPSNN: spiking neural network for low-latency streaming speech enhancement
    Sun, Tao
    Bohte, Sander
    NEUROMORPHIC COMPUTING AND ENGINEERING, 2024, 4 (04):
  • [6] LOW-LATENCY DEEP CLUSTERING FOR SPEECH SEPARATION
    Wang, Shanshan
    Naithani, Gaurav
    Virtanen, Tuomas
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 76 - 80
  • [7] Dynamic Transcription for Low-latency Speech Translation
    Niehues, Jan
    Nguyen, Thai Son
    Cho, Eunah
    Ha, Thanh-Le
    Kilgour, Kevin
    Mueller, Markus
    Sperber, Matthias
    Stueker, Sebastian
    Waibel, Alex
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2513 - 2517
  • [8] Low-latency monaural speech enhancement with deep filter-bank equalizer
    Zheng, Chengshi
    Liu, Wenzhe
    Li, Andong
    Ke, Yuxuan
    Li, Xiaodong
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2022, 151 (05): : 3291 - 3304
  • [9] Implementation of low-latency electrolaryngeal speech enhancement based on multi-task CLDNN
    Kobayashi, Kazuhiro
    Toda, Tomoki
    28TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2020), 2021, : 396 - 400
  • [10] Amortized Neural Networks for Low-Latency Speech Recognition
    Macoskey, Jonathan
    Strimel, Grant P.
    Su, Jinru
    Rastrow, Ariya
    INTERSPEECH 2021, 2021, : 4558 - 4562