Enhancing Real-World Active Speaker Detection With Multi-Modal Extraction Pre-Training

被引：0

作者：

Tao, Ruijie ^{[1
]}

Qian, Xinyuan ^{[2
]}

Das, Rohan Kumar ^{[3
]}

Gao, Xiaoxue ^{[4
]}

Wang, Jiadong ^{[1
]}

Li, Haizhou ^{[1
,5
,6
]}

机构：

[1] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore 119077, Singapore

[2] Univ Sci & Technol Beijing, Dept Comp Sci & Technol, Beijing 100083, Peoples R China

[3] Fortemedia, Singapore 138637, Singapore

[4] Inst Infocomm Res, Singapore 138632, Singapore

[5] Chinese Univ Hong Kong, Shenzhen Res Inst Big data, Sch Data Sci, Shenzhen 518172, Guangdong, Peoples R China

[6] Univ Bremen, D-28359 Bremen, Germany

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2025年 / 27卷

基金：

中国国家自然科学基金;

关键词：

Visualization; Lips; Videos; Face recognition; Time-domain analysis; Training; Synchronization; Speech recognition; Noise measurement; Correlation; Audio-visual active speaker detection; audio-visual target speech extraction; pre-training; self-supervised learning; VOICE ACTIVITY DETECTION; DATASET; SELF;

D O I：

10.1109/TMM.2024.3521791

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Audio-visual active speaker detection (AV-ASD) aims to identify which visible face is speaking in a scene with one or more persons. Most existing AV-ASD methods prioritize capturing speech-lip correspondence. However, there is a noticeable gap in addressing the challenges from real-world AV-ASD scenarios. Due to the presence of low-quality noisy videos in such cases, AV-ASD systems without a selective listening ability are short of effectively filtering out disruptive voice components from mixed audio inputs. In this paper, we propose a Multi-modal Speech Extraction-to-Detection framework named 'MuSED', which is pre-trained with audio-visual target speech extraction to learn the denoising ability, then it is fine-tuned with the AV-ASD task. Meanwhile, to better capture the multi-modal information and deal with real-world problems such as missing modality, MuSED is modelled on the time domain directly and integrates the multi-modal plus-and-minus augmentation strategy. Our experiments demonstrate that MuSED substantially outperforms the state-of-the-art AV-ASD methods and achieves 95.6% mAP on the AVA-ActiveSpeaker dataset, 98.3% AP on the ASW dataset, and 97.9% F1 on the Columbia AV-ASD dataset, respectively. We will publicly release the code in due course.

引用

页码：2362 / 2373

页数：12

共 66 条

[1] 11K Hands: Gender recognition and biometric identification using a large dataset of hand images [J].

Afifi, Mahmoud .

MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (15) :20835-20854

[2] Deep Audio-Visual Speech Recognition [J].

Afouras, Triantafyllos ;

Chung, Joon Son ;

Senior, Andrew ;

Vinyals, Oriol ;

Zisserman, Andrew .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) :8717-8727

[3] Self-supervised Learning of Audio-Visual Objects from Video [J].

Afouras, Triantafyllos ;

Owens, Andrew ;

Chung, Joon Son ;

Zisserman, Andrew .

COMPUTER VISION - ECCV 2020, PT XVIII, 2020, 12363 :208-224

[4] My lips are concealed: Audio-visual speech enhancement through obstructions [J].

Afouras, Triantafyllos ;

Chung, Joon Son ;

Zisserman, Andrew .

INTERSPEECH 2019, 2019, :4295-4299

[5]

Afouras T, 2018, INTERSPEECH, P3244

[6] End-to-End Active Speaker Detection [J].

Alcazar, Juan Leon ;

Cordes, Moritz ;

Zhao, Chen ;

Ghanem, Bernard .

COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 :126-143

[7] Active Speakers in Context [J].

Alcazar, Juan Leon ;

Heilbron, Fabian Caba ;

Mai, Long ;

Perazzi, Federico ;

Lee, Joon-Young ;

Arbelaez, Pablo ;

Ghanem, Bernard .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :12462-12471

[8]

Baevski A, 2022, PR MACH LEARN RES

[9] RealVAD: A Real-World Dataset and A Method for Voice Activity Detection by Body Motion Analysis [J].

Beyan, Cigdem ;

Shahid, Muhammad ;

Murino, Vittorio .

IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 :2071-2085

[10]

Bronkhorst AW, 2000, ACUSTICA, V86, P117

← 1 2 3 4 5 6 7 →