Sampling-Frequency-Independent Convolutional Layer and its Application to Audio Source Separation

被引:4
作者
Saito, Koichi [1 ]
Nakamura, Tomohiko [1 ]
Yatabe, Kohei [2 ]
Saruwatari, Hiroshi [1 ]
机构
[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo 1848588, Japan
[2] Tokyo Univ Agr & Technol, Dept Elect Engn & Comp Sci, Tokyo 1848588, Japan
关键词
Convolution; Source separation; Finite impulse response filters; Task analysis; Time-frequency analysis; Time-domain analysis; Information filters; Audio source separation; analog-to-digital filter conversion; convolutional layer; deep neural networks; SPEECH SEPARATION; NEURAL-NETWORK; RECOGNITION;
D O I
10.1109/TASLP.2022.3203907
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Audio source separation is often used for the preprocessing of various tasks, and one of its ultimate goals is to construct a single versatile preprocessor that can handle every variety of audio signal. One of the most important varieties of the discrete-time audio signal is sampling frequency. Since it is usually task-specific, the versatile preprocessor must handle all the sampling frequencies required by the possible downstream tasks. However, conventional models based on deep neural networks (DNNs) are not designed for handling a variety of sampling frequencies. Thus, for unseen sampling frequencies, they may not work appropriately. In this paper, we propose sampling-frequency-independent (SFI) convolutional layers capable of handling various sampling frequencies. The core idea of the proposed layers comes from our finding that a convolutional layer can be viewed as a collection of digital filters and inherently depends on sampling frequency. To overcome this dependency, we propose an SFI structure that features analog filters and generates weights of a convolutional layer from the analog filters. By utilizing time- and frequency-domain analog-to-digital filter conversion techniques, we can adapt the convolutional layer for various sampling frequencies. As an example application, we construct an SFI version of a conventional source separation network. Through music source separation experiments, we show that the proposed layers enable separation networks to consistently work well for unseen sampling frequencies in objective and perceptual separation qualities. We also demonstrate that the proposed method outperforms a conventional method based on signal resampling when the sampling frequencies of input signals are significantly lower than the trained sampling frequency.
引用
收藏
页码:2928 / 2943
页数:16
相关论文
共 50 条
[1]  
[Anonymous], 2013, ARXIV13013605
[2]   Musical Source Separation An introduction [J].
Cano, Estefania ;
FitzGerald, Derry ;
Liutkus, Antoine ;
Plumbley, Mark D. ;
Stoeter, Fabian-Robert .
IEEE SIGNAL PROCESSING MAGAZINE, 2019, 36 (01) :31-40
[3]   Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation [J].
Chen, Jingjing ;
Mao, Qirong ;
Liu, Dong .
INTERSPEECH 2020, 2020, :2642-2646
[4]  
Chetlur S, 2014, Arxiv, DOI arXiv:1410.0759
[5]  
Défossez A, 2021, Arxiv, DOI arXiv:1911.13254
[6]  
Ditter D, 2020, INT CONF ACOUST SPEE, P36, DOI [10.1109/icassp40776.2020.9053602, 10.1109/ICASSP40776.2020.9053602]
[8]   Mixed-Bandwidth Cross-Channel Speech Recognition via Joint Optimization of DNN-Based Bandwidth Expansion and Acoustic Modeling [J].
Gao, Jianqing ;
Du, Jun ;
Chen, Enhong .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (03) :559-571
[9]   DERIVATION OF AUDITORY FILTER SHAPES FROM NOTCHED-NOISE DATA [J].
GLASBERG, BR ;
MOORE, BCJ .
HEARING RESEARCH, 1990, 47 (1-2) :103-138
[10]   Impact of Aliasing on Deep CNN-Based End-to-End Acoustic Models [J].
Gong, Yuan ;
Poellabauer, Christian .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :2698-2702