Multi-Modal Localization and Enhancement of Multiple Sound Sources from a Micro Aerial Vehicle

被引：13

作者：

Sanchez-Matilla, Ricardo ^{[1
]}

Wang, Lin ^{[1
]}

Cavallaro, Andrea ^{[1
]}

机构：

[1] Queen Mary Univ London, Ctr Intelligent Sensing, London, England

来源：

PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17) | 2017年

关键词：

audio-visual sensing; ego-noise reduction; micro aerial vehicles; microphone array; multi-modal localization; enhancement of multiple sound sources; multiple object tracking; TRACKING;

D O I：

10.1145/3123266.3123412

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

The ego-noise generated by the motors and propellers of a micro aerial vehicle (MAV) masks the environmental sounds and considerably degrades the quality of the on-board sound recording. Sound enhancement approaches generally require knowledge of the direction of arrival of the target sound sources, which are difficult to estimate due to the low signal-to-noise-ratio (SNR) caused by the ego-noise and the interferences between multiple sources. To address this problem, we propose a multi-modal analysis approach that jointly exploits audio and video data to enhance the sounds of multiple targets captured from an MAV equipped with a microphone array and a video camera. We first perform audiovisual calibration via camera resectioning, audio-visual temporal alignment and geometrical alignment to jointly use the features in the audio and video streams, which are independently generated. The spatial information from the video is used to assist sound enhancement by tracking multiple potential sound sources with a particle filter. Then we infer the directions of arrival of the target sources from the video tracking results and extract the sound from the desired direction with a time-frequency spatial filter, which suppresses the ego-noise by exploiting its time-frequency sparsity. Experimental results with real outdoor data verify the robustness of the proposed multi-modal approach for multiple speakers in extremely low-SNR scenarios.

引用

页码：1591 / 1599

页数：9

共 36 条

[1]

[Anonymous], P MSS NAT S SENS DAT

[2]

[Anonymous], 2005, PROC CVPR IEEE

[3]

Basiri M, 2012, IEEE INT C INT ROBOT, P4737, DOI 10.1109/IROS.2012.6385608

[4] Near-Online Multi-target Tracking with Aggregated Local Flow Descriptor [J].

Choi, Wongun .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :3029-3037

[5] GSVD-based optimal filtering for single and multimicrophone speech enhancement [J].

Doclo, S ;

Moonen, M .

IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2002, 50 (09) :2230-2244

[6] Fast Feature Pyramids for Object Detection [J].

Dollar, Piotr ;

Appel, Ron ;

Belongie, Serge ;

Perona, Pietro .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014, 36 (08) :1532-1545

[7] Object Detection with Discriminatively Trained Part-Based Models [J].

Felzenszwalb, Pedro F. ;

Girshick, Ross B. ;

McAllester, David ;

Ramanan, Deva .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2010, 32 (09) :1627-1645

[8] Impact of ghosts on the mechanical, optical, and barrier properties of corn starch films [J].

Garcia-Hernandez, Angeles ;

Vernon-Carter, E. Jaime ;

Alvarez-Ramirez, Jose .

STARCH-STARKE, 2017, 69 (1-2) :1-2

[9] A four-step camera calibration procedure with implicit image correction [J].

Heikkila, J ;

Silven, O .

1997 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, PROCEEDINGS, 1997, :1106-1112

[10] Evaluation of objective quality measures for speech enhancement [J].

Hu, Yi ;

Loizou, Philipos C. .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2008, 16 (01) :229-238

← 1 2 3 4 →