Speaker and Phoneme-Aware Speech Bandwidth Extension with Residual Dual-Path Network

被引：6

作者：

Hou, Nana ^{[1
]}

Xu, Chenglin ^{[1
,4
]}

Van Tung Pham ^{[1
]}

Zhou, Joey Tianyi ^{[3
]}

Chng, Eng Siong ^{[1
,2
]}

Li, Haizhou ^{[4
,5
]}

机构：

[1] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore

[2] Nanyang Technol Univ, Temasek Labs, Singapore, Singapore

[3] ASTAR, Inst High Performance Comp IHPC, Singapore, Singapore

[4] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore, Singapore

[5] Univ Bremen, Machine Listening Lab, Bremen, Germany

来源：

INTERSPEECH 2020 | 2020年

基金：

新加坡国家研究基金会;

关键词：

Speech bandwidth extension; Residual dual-path network; Speaker and phoneme knowledge; I-vector; Phonetic posteriorgram; WAVE;

D O I：

10.21437/Interspeech.2020-1994

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Speech bandwidth extension aims to generate a wideband signal from a narrowband (low-band) input by predicting the missing high-frequency components. It is believed that the general knowledge about the speaker and phonetic content strengthens the prediction. In this paper, we propose to augment the low-band acoustic features with i-vector and phonetic posteriorgram (PPG), which represent speaker and phonetic content of the speech, respectively. We also propose a residual dual-path network (RDPN) as the core module to process the augmented features, which fully utilizes the utterance-level temporal continuity information and avoids gradient vanishing. Experiments show that the proposed method achieves 20.2% and 7.0% relative improvements over the best baseline in terms of log-spectral distortion (LSD) and signal-to-noise ratio (SNR), respectively. Furthermore, our method is 16 times more compact than the best baseline in terms of the number of parameters.

引用

页码：4064 / 4068

页数：5

共 36 条

[1] Abel J, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5469, DOI 10.1109/ICASSP.2018.8462362
[2] EFFECTIVENESS OF LINEAR PREDICTION CHARACTERISTICS OF SPEECH WAVE FOR AUTOMATIC SPEAKER IDENTIFICATION AND VERIFICATION
ATAL, BS
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1974, 55 (06) : 1304 - 1312
[3] Ba J. L., 2016, LAYER NORMALIZATION, Vabs/1607.06450
[4] Bachhav P, 2019, INT CONF ACOUST SPEE, P7010, DOI [10.1109/ICASSP.2019.8683611, 10.1109/icassp.2019.8683611]
[5] Eskimez SE, 2019, INT CONF ACOUST SPEE, P3717, DOI [10.1109/ICASSP.2019.8682215, 10.1109/icassp.2019.8682215]
[6] Speech Bandwidth Extension Using Bottleneck Features and Deep Recurrent Neural Networks
Gu, Yu
Ling, Zhen-Hua
Dai, Li-Rong
[J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 297 - 301
[7] Gupta A, 2019, IEEE WORK APPL SIG, P205, DOI [10.1109/waspaa.2019.8937169, 10.1109/WASPAA.2019.8937169]
[8] Hao X, 2020, INT CONF ACOUST SPEE, P866, DOI [10.1109/icassp40776.2020.9054551, 10.1109/ICASSP40776.2020.9054551]
[9] Hou NN, 2019, ASIAPAC SIGN INFO PR, P667, DOI 10.1109/APSIPAASC47483.2019.9023218
[10] I. Rec, 2005, P. 862.2: Wideband extension to recommendation p. 862 for the assessment of wideband telephone networks and speech codecs

← 1 2 3 4 →