Channel-Time-Frequency Attention Module for Improved Multi-Channel Speech Enhancement

被引:0
作者
Zeng, Xiao [1 ]
Wang, Mingjiang [1 ]
机构
[1] Harbin Inst Technol, Key Lab Key Technol IoT Terminals, Shenzhen 518055, Peoples R China
基金
中国国家自然科学基金;
关键词
Feature extraction; Time-frequency analysis; Speech enhancement; Microphone arrays; Data mining; Adaptation models; Robustness; Noise measurement; Attention mechanisms; Acoustics; Multi-channel speech enhancement; beamforming; CTFA; deep learning;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Both spatial and tempo-spectral information are essential for multi-channel speech enhancement, a field that has gained significant popularity in recent years. While many studies focus on improving feature extraction capabilities through unique network architectures, these approaches often prioritize raw feature learning without fully addressing how to effectively utilize the extracted features for enhanced performance. In this work, we focus on the post-extracted features and introduce a Channel-Time-Frequency Attention (CTFA) module, which allocates weights to the extracted features, aiming to enhance feature utilization and enabling the model to focus more effectively on informative features. The CTFA module is structured with three parallel attention branches-channel, time, and frequency branches-to effectively refine both spatial and tempo-spectral features. It facilitates better feature reuse by assigning greater weight to effective features, thereby improving the model's robustness. We incorporate the CTFA module into our previously proposed model and conduct an ablation study to evaluate its effectiveness. Extensive experimental results confirm the efficacy of the CTFA module, with our proposed method outperforming state-of-the-art baselines.
引用
收藏
页码:44418 / 44427
页数:10
相关论文
共 39 条
[1]  
Baevski A, 2020, ADV NEUR IN, V33
[2]  
Clevert DA, 2016, Arxiv, DOI [arXiv:1511.07289, 10.48550/arxiv.1511.07289]
[3]   Multi-objective based multi-channel speech enhancement with BiLSTM network [J].
Cui, Xingyue ;
Chen, Zhe ;
Yin, Fuliang .
APPLIED ACOUSTICS, 2021, 177
[4]   Improved MVDR beamforming using single-channel mask prediction networks [J].
Erdogan, Hakan ;
Hershey, John ;
Watanabe, Shinji ;
Mandel, Michael ;
Le Roux, Jonathan .
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :1981-1985
[5]   FSD50K: An Open Dataset of Human-Labeled Sound Events [J].
Fonseca, Eduardo ;
Favory, Xavier ;
Pons, Jordi ;
Font, Frederic ;
Serra, Xavier .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 :829-852
[6]   L3DAS22 CHALLENGE: LEARNING 3D AUDIO SOURCES IN A REAL OFFICE ENVIRONMENT [J].
Guizzo, Eric ;
Marinoni, Christian ;
Pennese, Marco ;
Ren, Xinlei ;
Zheng, Xiguang ;
Zhang, Chen ;
Masiero, Bruno ;
Uncini, Aurelio ;
Comminiello, Danilo .
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :9186-9190
[7]   A Cross-channel Attention-based Wave-U-Net for Multi-channel Speech Enhancement [J].
Ho, Minh Tri ;
Lee, Jinyoung ;
Lee, Bong-Ki ;
Yi, Dong Hoon ;
Kang, Hong-Goo .
INTERSPEECH 2020, 2020, :4049-4053
[8]   Real-Time Single Channel Speech Enhancement Using Triple Attention and Stacked Squeeze-TCN [J].
Jannu, Chaitanya ;
Burra, Manaswini ;
Vanambathina, Sunny Dayal ;
Parisae, Veeraswamy .
COMPUTATIONAL INTELLIGENCE, 2025, 41 (01)
[9]  
Reddy CKA, 2020, Arxiv, DOI arXiv:2005.13981
[10]  
Kingma DP., 2014, P 2 INT C LEARN REPR