Enhancing Real-World Active Speaker Detection With Multi-Modal Extraction Pre-Training

被引:0
作者
Tao, Ruijie [1 ]
Qian, Xinyuan [2 ]
Das, Rohan Kumar [3 ]
Gao, Xiaoxue [4 ]
Wang, Jiadong [1 ]
Li, Haizhou [1 ,5 ,6 ]
机构
[1] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore 119077, Singapore
[2] Univ Sci & Technol Beijing, Dept Comp Sci & Technol, Beijing 100083, Peoples R China
[3] Fortemedia, Singapore 138637, Singapore
[4] Inst Infocomm Res, Singapore 138632, Singapore
[5] Chinese Univ Hong Kong, Shenzhen Res Inst Big data, Sch Data Sci, Shenzhen 518172, Guangdong, Peoples R China
[6] Univ Bremen, D-28359 Bremen, Germany
基金
中国国家自然科学基金;
关键词
Visualization; Lips; Videos; Face recognition; Time-domain analysis; Training; Synchronization; Speech recognition; Noise measurement; Correlation; Audio-visual active speaker detection; audio-visual target speech extraction; pre-training; self-supervised learning; VOICE ACTIVITY DETECTION; DATASET; SELF;
D O I
10.1109/TMM.2024.3521791
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Audio-visual active speaker detection (AV-ASD) aims to identify which visible face is speaking in a scene with one or more persons. Most existing AV-ASD methods prioritize capturing speech-lip correspondence. However, there is a noticeable gap in addressing the challenges from real-world AV-ASD scenarios. Due to the presence of low-quality noisy videos in such cases, AV-ASD systems without a selective listening ability are short of effectively filtering out disruptive voice components from mixed audio inputs. In this paper, we propose a Multi-modal Speech Extraction-to-Detection framework named 'MuSED', which is pre-trained with audio-visual target speech extraction to learn the denoising ability, then it is fine-tuned with the AV-ASD task. Meanwhile, to better capture the multi-modal information and deal with real-world problems such as missing modality, MuSED is modelled on the time domain directly and integrates the multi-modal plus-and-minus augmentation strategy. Our experiments demonstrate that MuSED substantially outperforms the state-of-the-art AV-ASD methods and achieves 95.6% mAP on the AVA-ActiveSpeaker dataset, 98.3% AP on the ASW dataset, and 97.9% F1 on the Columbia AV-ASD dataset, respectively. We will publicly release the code in due course.
引用
收藏
页码:2362 / 2373
页数:12
相关论文
共 66 条
[1]   11K Hands: Gender recognition and biometric identification using a large dataset of hand images [J].
Afifi, Mahmoud .
MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (15) :20835-20854
[2]   Deep Audio-Visual Speech Recognition [J].
Afouras, Triantafyllos ;
Chung, Joon Son ;
Senior, Andrew ;
Vinyals, Oriol ;
Zisserman, Andrew .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) :8717-8727
[3]   Self-supervised Learning of Audio-Visual Objects from Video [J].
Afouras, Triantafyllos ;
Owens, Andrew ;
Chung, Joon Son ;
Zisserman, Andrew .
COMPUTER VISION - ECCV 2020, PT XVIII, 2020, 12363 :208-224
[4]   My lips are concealed: Audio-visual speech enhancement through obstructions [J].
Afouras, Triantafyllos ;
Chung, Joon Son ;
Zisserman, Andrew .
INTERSPEECH 2019, 2019, :4295-4299
[5]  
Afouras T, 2018, INTERSPEECH, P3244
[6]   End-to-End Active Speaker Detection [J].
Alcazar, Juan Leon ;
Cordes, Moritz ;
Zhao, Chen ;
Ghanem, Bernard .
COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 :126-143
[7]   Active Speakers in Context [J].
Alcazar, Juan Leon ;
Heilbron, Fabian Caba ;
Mai, Long ;
Perazzi, Federico ;
Lee, Joon-Young ;
Arbelaez, Pablo ;
Ghanem, Bernard .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :12462-12471
[8]  
Baevski A, 2022, PR MACH LEARN RES
[9]   RealVAD: A Real-World Dataset and A Method for Voice Activity Detection by Body Motion Analysis [J].
Beyan, Cigdem ;
Shahid, Muhammad ;
Murino, Vittorio .
IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 :2071-2085
[10]  
Bronkhorst AW, 2000, ACUSTICA, V86, P117