Power Wavelet Cepstral Coefficients (PWCC): An Accurate Auditory Model-Based Feature Extraction Method for Robust Speaker Recognition

被引：0

作者：

Zouhir, Youssef ^{[1
,2
]}

Zarka, Mohamed ^{[3
]}

Ouni, Kais ^{[1
,2
]}

Amraoui, Lilia El ^{[4
]}

机构：

[1] Univ Carthage, Natl Engn Sch Carthage, Res Lab Smart Elect, Tunis 2035, Tunisia

[2] Univ Carthage, Natl Engn Sch Carthage, SE&ICT Lab, ICT,LR18ES44, Tunis 2035, Tunisia

[3] King Khalid Univ, Appl Coll Tanumah, Dept Comp Sci, Muhayil 61913, Saudi Arabia

[4] Princess Nourah Bint Abdulrahman Univ, Coll Engn, Dept Elect Engn, POB 84428, Riyadh 11671, Saudi Arabia

来源：

IEEE ACCESS | 2025年 / 13卷

关键词：

Feature extraction; Mel frequency cepstral coefficient; Accuracy; Wavelet transforms; Noise measurement; Filters; Computational modeling; Time-frequency analysis; Adaptation models; Speech recognition; Speaker recognition; machine learning; GMM-UBM; feature extraction; MFCC; PNCC; power wavelet cepstral coefficients (PWCC); noise robustness; wavelet transform; auditory models; cochlear filtering; biometric authentication; NOISE; FREQUENCY;

D O I：

10.1109/ACCESS.2025.3576659

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Human capability for Speaker Recognition (SR) exceeds recent machine learning approaches, even in noisy environments. To bridge this gap, researchers investigate the human auditory system to support machine learning algorithm performance. The paper introduces a novel feature extraction method, named "Power Wavelet Cepstral Coefficients" (PWCC), for enhancing SR accuracy. This method is derived from the "Normalized Wavelet FilterBank" (NWFB), which utilizes an "Equivalent Rectangular Bandwidth" rate (ERB-rate) scale and additionally integrates a "Noise Suppression Module" (NSM). The NWFB imitates the cochlea's frequency selectivity using "Morlet Wavelet filters" alongside an ERB-rate scale. The NSM applies a medium-duration power analysis, an asymmetrical noise-suppression module incorporating a temporal masking component, and a spectral smoothing module to reduce the impact of noisy signal. To assess the performance of the proposed PWCC method, experiments were conducted using clean speech signals from the TIMIT database, corrupted with various noises from the AURORA dataset. Using a "Gaussian Mixture Model-Universal Background Model" (GMM-UBM) classifier, the PWCC method demonstrated superior SR accuracy in noisy environments compared to traditional methods such as PNCC and MFCC. Furthermore, PWCC maintained higher precision, recall, and F1-scores than PNCC and MFCC under overall noise conditions. For instance, with babble noise at 15 dB SNR, PWCC achieved a recognition rate of 92.06%, compared to 75.24% for PNCC and 68.33% for MFCC.

引用

页码：102323 / 102338

页数：16

共 68 条

[1]

Acero A., 1993, Acoustical and Environmental Robustness in Automatic Speech Recognition

[2]

Akay M., 1998, Time Frequency and Wavelets in Biomedical Signal Processing, P243

[3] A Review of Modern Audio Deepfake Detection Methods: Challenges and Future Directions [J].

Almutairi, Zaynab ;

Elgibreen, Hebah .

ALGORITHMS, 2022, 15 (05)

[4]

[Anonymous], 1993, TIMIT ACOUSTIC PHONE

[5] Multi-channel spectrograms for speech processing applications using deep learning methods [J].

Arias-Vergara, T. ;

Klumpp, P. ;

Vasquez-Correa, J. C. ;

Noeth, E. ;

Orozco-Arroyave, J. R. ;

Schuster, M. .

PATTERN ANALYSIS AND APPLICATIONS, 2021, 24 (02) :423-431

[6] SUPPRESSION OF ACOUSTIC NOISE IN SPEECH USING SPECTRAL SUBTRACTION [J].

BOLL, SF .

IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1979, 27 (02) :113-120

[7] Reverb and Noise as Real-World Effects in Speech Recognition Models: A Study and a Proposal of a Feature Set [J].

Cesarini, Valerio ;

Costantini, Giovanni .

APPLIED SCIENCES-BASEL, 2024, 14 (23)

[8] Potential of Speech-Pathological Features for Deepfake Speech Detection [J].

Chaiwongyen, Anuwat ;

Duangpummet, Suradej ;

Karnjana, Jessada ;

Kongprawechnon, Waree ;

Unoki, Masashi .

IEEE ACCESS, 2024, 12 :121958-121970

[9] Learning domain-heterogeneous speaker recognition systems with personalized continual federated learning [J].

Chen, Zhiyong ;

Xu, Shugong .

EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2023, 2023 (01)

[10] Analysis of Spectro-Temporal Modulation Representation for Deep-Fake Speech Detection [J].

Cheng, Haowei ;

Mawalim, Candy Olivia ;

Li, Kai ;

Wang, Lijun ;

Unoki, Masashi .

2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, :1822-1829

← 1 2 3 4 5 6 7 →