Deep Learning Assisted Time-Frequency Processing for Speech Enhancement on Drones

被引:21
|
作者
Wang, Lin [1 ]
Cavallaro, Andrea [1 ]
机构
[1] Queen Mary Univ London, Ctr Intelligent Sensing, London E1 4NS, England
来源
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE | 2021年 / 5卷 / 06期
基金
英国工程与自然科学研究理事会; “创新英国”项目;
关键词
Deep learning; Deep neural network (DNN); drone; ego-noise reduction; microphone array; SINGLE;
D O I
10.1109/TETCI.2020.3014934
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This article fills the gap between the growing interest in signal processing based on Deep Neural Networks (DNN) and the new application of enhancing speech captured by microphones on a drone. In this context, the quality of the target sound is degraded significantly by the strong ego-noise from the rotating motors and propellers. We present the first work that integrates single-channel and multi-channel DNN-based approaches for speech enhancement on drones. We employ a DNN to estimate the ideal ratio masks at individual time-frequency bins, which are subsequently used to design three potential speech enhancement systems, namely single-channel ego-noise reduction (DNN-S), multi-channel beamforming (DNN-BF), and multi-channel time-frequency spatial filtering (DNN-TF). The main novelty lies in the proposed DNN-TF algorithm, which infers the noise-dominance probabilities at individual time-frequency bins from the DNN-estimated soft masks, and then incorporates them into a time-frequency spatial filtering framework for ego-noise reduction. By jointly exploiting the direction of arrival of the target sound, the time-frequency sparsity of the acoustic signals (speech and ego-noise) and the time-frequency noise-dominance probability, DNN-TF can suppress the ego-noise effectively in scenarios with very low signal-to-noise ratios (e.g. SNR lower than -15 dB), especially when the direction of the target sound is close to that of a source of the ego-noise. Experiments with real and simulated data show the advantage of DNN-TF over competing methods, including DNN-S, DNN-BF and the state-of-the-art time-frequency spatial filtering.
引用
收藏
页码:871 / 881
页数:11
相关论文
共 50 条
  • [1] Joint Time-Frequency and Time Domain Learning for Speech Enhancement
    Tang, Chuanxin
    Luo, Chong
    Zhao, Zhiyuan
    Xie, Wenxuan
    Zeng, Wenjun
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 3816 - 3822
  • [2] Single channel speech enhancement via time-frequency dictionary learning
    Huang, Jianjun
    Zhang, Xiongwei
    Zhang, Yafei
    Zou, Xia
    Shengxue Xuebao/Acta Acustica, 2012, 37 (05): : 539 - 547
  • [3] Single channel speech enhancement via time-frequency dictionary learning
    HUANG Jianjun
    ZHANG Xiongwei
    ZHANG Yafei
    ZOU Xia
    Chinese Journal of Acoustics, 2013, 32 (01) : 90 - 102
  • [4] TIME-FREQUENCY ATTENTION FOR MONAURAL SPEECH ENHANCEMENT
    Zhang, Qiquan
    Song, Qi
    Ni, Zhaoheng
    Nicolson, Aaron
    Li, Haizhou
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7852 - 7856
  • [5] Neural speech enhancement in the time-frequency domain
    Volkmer, M
    2003 IEEE XIII WORKSHOP ON NEURAL NETWORKS FOR SIGNAL PROCESSING - NNSP'03, 2003, : 617 - 626
  • [6] Image processing for time-frequency speech analysis
    Benyoucef M.
    International Journal of Speech Technology, 2008, 11 (1) : 43 - 49
  • [7] Deep Speech Inpainting of Time-frequency Masks
    Kegler, Mikolaj
    Beckmann, Pierre
    Cernak, Milos
    INTERSPEECH 2020, 2020, : 3276 - 3280
  • [8] A Time-Frequency Attention Module for Neural Speech Enhancement
    Zhang, Qiquan
    Qian, Xinyuan
    Ni, Zhaoheng
    Nicolson, Aaron
    Ambikairajah, Eliathamby
    Li, Haizhou
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 462 - 475
  • [9] Integrated speech enhancement and coding in the time-frequency domain
    Drygajlo, A
    Carnero, B
    1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS - VOL V: STATISTICAL SIGNAL AND ARRAY PROCESSING, APPLICATIONS, 1997, : 1183 - 1186
  • [10] Adaptive time-frequency data fusion for speech enhancement
    Shi, G
    Aarabi, P
    Lazic, N
    FUSION 2003: PROCEEDINGS OF THE SIXTH INTERNATIONAL CONFERENCE OF INFORMATION FUSION, VOLS 1 AND 2, 2003, : 394 - 399