Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model

被引：10

作者：

Ahmad, Rehan ^{[1
]}

Zubair, Syed ^{[2
]}

Alquhayz, Hani ^{[3
]}

Ditta, Allah ^{[4
]}

机构：

[1] Int Islamic Univ, Dept Elect Engn, Islamabad 44000, Pakistan

[2] Analyt Camp, Islamabad 44000, Pakistan

[3] Majmaah Univ, Dept Comp Sci & Informat, Coll Sci Zulfi, Al Majmaah 11952, Saudi Arabia

[4] Univ Educ, Div Sci & Technol, Lahore 54770, Pakistan

来源：

SENSORS | 2019年 / 19卷 / 23期

关键词：

speaker diarization; SyncNet; Gaussian mixture model; diarization error rate; speech activity detection; MFCC; MEETINGS;

D O I：

10.3390/s19235163

中图分类号：

O65 [分析化学];

学科分类号：

070302 ; 081704 ;

摘要：

Speaker diarization systems aim to find 'who spoke when?' in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization. A pre-trained audio-visual synchronization model is used to find the synchronization between a visible person and the respective audio. For that purpose, short video segments comprised of face-only regions are acquired using a face detection technique and are then fed to the pre-trained model. This model is a two streamed network which matches audio frames with their respective visual input segments. On the basis of high confidence video segments inferred by the model, the respective audio frames are used to train Gaussian mixture model (GMM)-based clusters. This method helps in generating speaker specific clusters with high probability. We tested our approach on a popular subset of AMI meeting corpus consisting of 5.4 h of recordings for audio and 5.8 h of different set of multimodal recordings. A significant improvement is noticed with the proposed method in term of DER when compared to conventional and fully supervised audio based speaker diarization. The results of the proposed technique are very close to the complex state-of-the art multimodal diarization which shows significance of such simple yet effective technique.

引用

页数：14

共 54 条

[1] Anguera X, 2005, LECT NOTES COMPUT SC, V3869, P402
[2] Anguera X., 2006, P MLMI 2006 MAY, P346
[3] Anguera X, 2006, LECT NOTES COMPUT SC, V4299, P248
[4] Onsets Coincidence for Cross-Modal Analysis
Barzelay, Zohar
Schechner, Yoav Y.
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2010, 12 (02) : 108 - 120
[5] Bost X, 2015, INT CONF ACOUST SPEE, P4799, DOI 10.1109/ICASSP.2015.7178882
[6] Bredin H., 2007, P 24 ACM INT C MULT, P157
[7] Bredin H., 2016, CEUR WORKSH P, V1739, P2
[8] pyannote.metrics: a toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems
Bredin, Herve
[J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3587 - 3591
[9] Bredin H, 2017, INT CONF ACOUST SPEE, P5430, DOI 10.1109/ICASSP.2017.7953194
[10] Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis
Cabanas-Molero, P.
Lucena, M.
Fuertes, J. M.
Vera-Candeas, P.
Ruiz-Reyes, N.
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (20) : 27685 - 27707

← 1 2 3 4 5 6 →