Separation of speech & music using temporal-spectral features and neural classifiers

被引:3
作者
Sawant, Omkar [1 ]
Bhowmick, Anirban [1 ]
Bhagwat, Ganesh [2 ]
机构
[1] VIT Bhopal Univ, SEEE, Bhopal, India
[2] Mercedez Benz, Bangalore, India
关键词
Music; Speech; MFCC; Spectrograms; SEK; RNN; CNN; SVM; NETWORKS;
D O I
10.1007/s12065-023-00828-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Separation of speech and music plays a vital role in multiple fields related to audio and speech processing. The spectrograms of speech and music show distinct patterns. This serves as the motivation for the differentiation of speech and music signals in an audio segment. The patterns have been further emphasized using Sobel edge kernels, Mel-spectrograms. For the inception of this paper, we have made a dataset from "All India Radio" news archives which is having separate and overlapped speech and music data in different languages. The different input features are extracted from these audio segments and further emphasized before feeding them to the different classifiers for distinguishing speech and music frames. We also compared the different classification algorithms for their varied performance in terms of accuracy. We have found that the convolutional neural network based approach on Mel-spectrograms and MFCC-delta-RNN methods have given a significantly better result compared to other approaches. Further, we wanted to see how these approaches work in the audio data of different languages, hence, we have applied the proposed method in three different languages such as Bengali, Punjabi, and Tamil. We have seen that the performance of the proposed method in all languages is consistent. The paper has also attempted to solve the problem of classifying audio segments with overlapped speech and music regions and achieved a good level of accuracy.
引用
收藏
页码:1389 / 1403
页数:15
相关论文
共 25 条
[11]  
Carey Michael J, 1999, 1999 IEEE INT C ACOU, V1
[12]  
Ghosal A., 2011, Proceedings of the Second International Conference on Emerging Applications of Information Technology (EAIT 2011), P49, DOI 10.1109/EAIT.2011.19
[13]  
Hughes T, 2012, 1988 INT C ICASSP 88, P4917, DOI [10.1109/ICASSP.2012.6289022, DOI 10.1109/ICASSP.2012.6289022]
[14]  
Julien Pinquier, 2003, 2003 IEEE INT C ACOU, V2
[15]   Identification of Language using Mel-Frequency Cepstral Coefficients (MFCC) [J].
Koolagudi, Shashidhar G. ;
Rastogi, Deepika ;
Rao, K. Sreenivasa .
INTERNATIONAL CONFERENCE ON MODELLING OPTIMIZATION AND COMPUTING, 2012, 38 :3391-3398
[16]  
Li Z, 2018, 2018 14 IEEE INT C S, DOI [10.1109/icsp.2018.8652295, DOI 10.1109/ICSP.2018.8652295]
[17]   A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects [J].
Li, Zewen ;
Liu, Fan ;
Yang, Wenjie ;
Peng, Shouheng ;
Zhou, Jun .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (12) :6999-7019
[18]  
Li ZT, 2018, INT CONF SIGN PROCES, P260, DOI 10.1109/ICSP.2018.8652295
[19]   A lazy learning-based language identification from speech using MFCC-2 features [J].
Mukherjee, Himadri ;
Obaidullah, Sk Md ;
Santosh, K. C. ;
Phadikar, Santanu ;
Roy, Kaushik .
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2020, 11 (01) :1-14
[20]  
Munoz-Exposito J. E., 2006, 2006 14 EUR SIGN PRO