Improved Lite Audio-Visual Speech Enhancement

被引：10

作者：

Chuang, Shang-Yi ^{[1
]}

Wang, Hsin-Min ^{[2
]}

Tsao, Yu ^{[1
]}

机构：

[1] Acad Sinica, Res Ctr Informat Technol Innovat, Taipei 115, Taiwan

[2] Acad Sinica, Inst Informat Sci, Taipei 115, Taiwan

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2022年 / 30卷

关键词：

Visualization; Speech enhancement; Data models; Noise measurement; Hidden Markov models; Costs; Sensors; Asynchronous multimodal learning; audio-visual; data compression; low-quality data; speech enhancement; NOISE-REDUCTION; INTELLIGIBILITY; RECOGNITION; ALGORITHMS; QUALITY;

D O I：

10.1109/TASLP.2022.3153265

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Numerous studies have investigated the effectiveness of audio-visual multimodal learning for speech enhancement (AVSE) tasks, seeking a solution that uses visual data as auxiliary and complementary input to reduce the noise of noisy speech signals. Recently, we proposed a lite audio-visual speech enhancement (LAVSE) algorithm for a car-driving scenario. Compared to conventional AVSE systems, LAVSE requires less online computation and to some extent solves the user privacy problem on facial data. In this study, we extend LAVSE to improve its ability to address three practical issues often encountered in implementing AVSE systems, namely, the additional cost of processing visual data, audio-visual asynchronization, and low-quality visual data. The proposed system is termed improved LAVSE (iLAVSE), which uses a convolutional recurrent neural network architecture as the core AVSE model. We evaluate iLAVSE on the Taiwan Mandarin speech with video dataset. Experimental results confirm that compared to conventional AVSE systems, iLAVSE can effectively overcome the aforementioned three practical issues and can improve enhancement performance. The results also confirm that iLAVSE is suitable for real-world scenarios, where high-quality audio-visual sensors may not always be available.

引用

页码：1345 / 1359

页数：15

共 105 条

[1] Comparing Fusion Models for DNN-Based Audiovisual Continuous Speech Recognition
Abdelaziz, Ahmed Hussen
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (03) : 475 - 484
[2] Lip-Reading Driven Deep Learning Approach for Speech Enhancement
Adeel, Ahsan
Gogate, Mandar
Hussain, Amir
Whitmer, William M.
[J]. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2021, 5 (03): : 481 - 490
[3] [Anonymous], 2008, Springer Handbook of Speech Processing
[4] [Anonymous], 2011, P 4 INT C PERV TECHN
[5] PREDICTIVE CODING OF SPEECH SIGNALS AND SUBJECTIVE ERROR CRITERIA
ATAL, BS
SCHROEDER, MR
[J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1979, 27 (03): : 247 - 254
[6] On-line learning algorithms for locally recurrent neural networks
Campolucci, P
Uncini, A
Piazza, F
Rao, BD
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS, 1999, 10 (02): : 253 - 271
[7] Robust Principal Component Analysis?
Candes, Emmanuel J.
Li, Xiaodong
Ma, Yi
Wright, John
[J]. JOURNAL OF THE ACM, 2011, 58 (03)
[8] Joint NN-Supported Multichannel Reduction of Acoustic Echo, Reverberation and Noise
Carbajal, Guillaume
Serizel, Romain
Vincent, Emmanuel
Humbert, Eric
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 (28) : 2158 - 2173
[9] Improving the performance of k-means for color quantization
Celebi, M. Emre
[J]. IMAGE AND VISION COMPUTING, 2011, 29 (04) : 260 - 271
[10] Chen F, 2015, EAR HEARING, V36, P61, DOI 10.1097/AUD.0000000000000074

← 1 2 3 4 5 6 7 8 9 10 →