SPEECH INTELLIGIBILITY ENHANCEMENT USING NON-PARALLEL SPEAKING STYLE CONVERSION WITH STARGAN AND DYNAMIC RANGE COMPRESSION

被引：0

作者：

Li, Gang ^{[1
,2
]}

Hu, Ruimin ^{[1
,2
]}

Ke, Shanfa ^{[1
]}

Zhang, Rui ^{[1
]}

Wang, Xiaochen ^{[1
,3
]}

Gao, Li ^{[1
]}

机构：

[1] Wuhan Univ, Natl Engn Res Ctr Multimedia Software, Sch Comp Sci, Wuhan, Hubei, Peoples R China

[2] Wuhan Univ, Hubei Key Lab Multimedia & Network Commun Engn, Wuhan, Hubei, Peoples R China

[3] Wuhan Univ Shenzhen, Res Inst, Shenzhen, Peoples R China

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME) | 2020年

关键词：

speech intelligibility; Lombard effect; speaking style conversion (SSC); StarGAN; dynamic range compression (DRC); LOMBARD SPEECH; VOCODER; NOISE;

D O I：

10.1109/icme46284.2020.9102916

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Speech intelligibility enhancement is a perceptual enhancement technique for clean speech reproduced in noisy environments. It is typically used in the listening stage of multimedia communications. In this study, we enhance speech intelligibility by speaking style conversion (SSC), which is a datadriven approach inspired by a vocal mechanism named Lombard effect. The proposed SSC method combines star generative adversarial network (StarGAN) based mapping and dynamic range compression (DRC). It has two main advantages: 1) different from gender-independent conversion in previous studies, StarGAN can separately learn speech features of different genders to provide a differential conversion among genders with a single model and non-parallel training data; 2) we design a multi-level enhancement strategy with the use of DRC in the StarGAN architecture, which improves the SSC performance in strong noise interference. Experiments show that our method outperforms baseline methods.

引用

页数：6

共 19 条

[1] A corpus of audio-visual Lombard speech with frontal and profile views
Alghamdi, Najwa
Maddock, Steve
Marxer, Ricard
Barker, Jon
Brown, Guy J.
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2018, 143 (06) : EL523 - EL529
[2] [Anonymous], 1996, ITU T RECOMMENDATION
[3] StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation
Choi, Yunjey
Choi, Minje
Kim, Munyoung
Ha, Jung-Woo
Kim, Sunghun
Choo, Jaegul
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 8789 - 8797
[4] Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?
Garnier, Maeva
Henrich, Nathalie
[J]. COMPUTER SPEECH AND LANGUAGE, 2014, 28 (02) : 580 - 597
[5] Jokinen E., 2014, P INTERSPEECH, P2036
[6] STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds
Kawahara, Hideki
[J]. ACOUSTICAL SCIENCE AND TECHNOLOGY, 2006, 27 (06) : 349 - 353
[7] Optimizing Speech Intelligibility in a Noisy Environment
Kleijn, W. Bastiaan
Crespo, Joao B.
Hendriks, Richard C.
Petkov, Petko N.
Sauert, Bastian
Vary, Peter
[J]. IEEE SIGNAL PROCESSING MAGAZINE, 2015, 32 (02) : 43 - 54
[8] High quality voice conversion through phoneme-based linear mapping functions with STRAIGHT for mandarin
Liu, Kun
Zhang, Jianping
Yan, Yonghong
[J]. FOURTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 4, PROCEEDINGS, 2007, : 410 - 414
[9] Speaking style conversion from normal to Lombard speech using a glottal vocoder and Bayesian GMMs
Lopez, Ana Ramirez
Seshadri, Shreyas
Juvela, Lauri
Rasanen, Okko
Alku, Paavo
[J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1363 - 1367
[10] WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications
Morise, Masanori
Yokomori, Fumiya
Ozawa, Kenji
[J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2016, E99D (07): : 1877 - 1884

← 1 2 →