Leveraging Visual Supervision for Array-Based Active Speaker Detection and Localization

被引：4

作者：

Berghi, Davide ^{[1
]}

Jackson, Philip J. B. ^{[1
]}

机构：

[1] Univ Surrey, Ctr Vis Speech & Signal Proc CVSSP, Surrey GU2 7XH, England

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

基金：

英国工程与自然科学研究理事会; “创新英国”项目;

关键词：

Active speaker detection and localization; self-supervised learning; multichannel; microphone array; SOUND EVENT LOCALIZATION; TRACKING;

D O I：

10.1109/TASLP.2023.3346643

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Conventional audio-visual approaches for active speaker detection (ASD) typically rely on visually pre-extracted face tracks and the corresponding single-channel audio to find the speaker in a video. Therefore, they tend to fail every time the face of the speaker is not visible. We demonstrate that a simple audio convolutional recurrent neural network (CRNN) trained with spatial input features extracted from multichannel audio can perform simultaneous horizontal active speaker detection and localization (ASDL), independently of the visual modality. To address the time and cost of generating ground truth labels to train such a system, we propose a new self-supervised training pipeline that embraces a "student-teacher" learning approach. A conventional pre-trained active speaker detector is adopted as a "teacher" network to provide the position of the speakers as pseudo-labels. The multichannel audio "student" network is trained to generate the same results. At inference, the student network can generalize and locate also the occluded speakers that the teacher network is not able to detect visually, yielding considerable improvements in recall rate. Experiments on the TragicTalkers dataset show that an audio network trained with the proposed self-supervised learning approach can exceed the performance of the typical audio-visual methods and produce results competitive with the costly conventional supervised training. We demonstrate that improvements can be achieved when minimal manual supervision is introduced in the learning pipeline. Further gains may be sought with larger training sets and integrating vision with the multichannel audio system.

引用

页码：984 / 995

页数：12

共 64 条

[1]

Adavanne S., 2019, P 4 WORKSH DET CLASS, P10, DOI 10.33682/1xwd-5v76

[2] Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks [J].

Adavanne, Sharath ;

Politis, Archontis ;

Nikunen, Joonas ;

Virtanen, Tuomas .

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2019, 13 (01) :34-48

[3]

Afouras T, 2018, Arxiv, DOI arXiv:1809.00496

[4]

Afouras T, 2018, INTERSPEECH, P3244

[5] Emotion Recognition in Speech using Cross-Modal Transfer in the Wild [J].

Albanie, Samuel ;

Nagrani, Arsha ;

Vedaldi, Andrea ;

Zisserman, Andrew .

PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, :292-301

[6]

Alcazar J. L, 2020, IEEECVF C COMPUTER V

[7] End-to-End Active Speaker Detection [J].

Alcazar, Juan Leon ;

Cordes, Moritz ;

Zhao, Chen ;

Ghanem, Bernard .

COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 :126-143

[8]

[Anonymous], About us

[9] Objects that Sound [J].

Arandjelovic, Relja ;

Zisserman, Andrew .

COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 :451-466

[10] Look, Listen and Learn [J].

Arandjelovic, Relja ;

Zisserman, Andrew .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :609-617

← 1 2 3 4 5 6 7 →