Probabilistic Speaker Diarization With Bag-of-Words Representations of Speaker Angle Information

被引:14
作者
Ishiguro, Katsuhiko [1 ]
Yamada, Takeshi
Araki, Shoko [1 ]
Nakatani, Tomohiro [1 ]
Sawada, Hiroshi [1 ]
机构
[1] NTT Corp, NTT Commun Sci Labs, Kyoto 6190237, Japan
来源
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2012年 / 20卷 / 02期
关键词
Bag-of-words (BOW); clustering; direction of arrival (DOA); latent Dirichlet allocation (LDA); speaker diarization; microphone arrays; variational Bayes inference; LECTURE;
D O I
10.1109/TASL.2011.2151858
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speaker diarization determines "who spoke when" from the recorded conversations of an unknown number of people. In general, we have no a priori information about the number, the locations, or even the characteristics of the speakers. Additionally, speakers' speech utterances vary dynamically because of turn-taking during the conversations. These conditions make the speaker-clustering task extremely difficult. The problem becomes even harder if online (incremental) processing is required. In this paper, we formulate the speaker-clustering problem as the clustering of the sequential audio features generated by an unknown number of latent mixture components (speakers). We employ a probabilistic model that assumes time-sensitive speaker mixtures at every time frame, which, surprisingly, suits the diarization scenario. We combine the time-varying probabilistic model with direction of arrival (DOA) information calculated from a microphone array in a bag-of-words (BoW)-style feature representation. The proposed system effectively estimates the number and locations of the speakers in an online manner based on the standard Bayes inference scheme. Experiments confirm that the proposed model can successfully infer the number and features of speakers and yield better or comparable speaker diarization results compared with conventional methods in several datasets.
引用
收藏
页码:447 / 460
页数:14
相关论文
共 28 条
  • [21] MODELING AUDIO DIRECTIONAL STATISTICS USING A PROBABILISTIC SPATIAL DICTIONARY FOR SPEAKER DIARIZATION IN REAL MEETINGS
    Fakhry, Mahmoud
    Ito, Nobutaka
    Araki, Shoko
    Nakatani, Tomohiro
    2016 IEEE INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC), 2016,
  • [22] Speaker Diarization using Eye-gaze Information in Multi-party Conversations
    Inoue, Koji
    Wakabayashi, Yukoh
    Yoshimoto, Hiromasa
    Kawahara, Tatsuya
    15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 562 - 566
  • [23] Speaker diarization for multiple-distant-microphone meetings using several sources of information
    Pardo, Jose M.
    Anguera, Xavier
    Wooters, Charles
    IEEE TRANSACTIONS ON COMPUTERS, 2007, 56 (09) : 1212 - 1224
  • [24] Enhanced speaker diarization with detection of backchannels using eye-gaze information in poster conversations
    Inoue, Koji
    Wakabayashi, Yukoh
    Yoshimoto, Hiromasa
    Takanashi, Katsuya
    Kawahara, Tatsuya
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 3086 - 3090
  • [25] LOW-LATENCY SPEAKER DIARIZATION BASED ON BAYESIAN INFORMATION CRITERION WITH MULTIPLE PHONEME CLASSES
    Oku, Takahiro
    Sato, Shoei
    Kobayashi, Akio
    Homma, Shinichi
    Imai, Toru
    2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 4189 - 4192
  • [26] Integrating Online I-vector extractor with Information Bottleneck based Speaker Diarization system
    Madikeri, Srikanth
    Himawan, Ivan
    Motlicek, Petr
    Ferras, Marc
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 3105 - 3109
  • [27] INCREMENTAL TRANSFER LEARNING IN TWO-PASS INFORMATION BOTTLENECK BASED SPEAKER DIARIZATION SYSTEM FOR MEETINGS
    Dawalatabad, Nauman
    Madikeri, Srikanth
    Sekhar, C. Chandra
    Murthy, Hema A.
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6291 - 6295
  • [28] MAXIMUM-LIKELIHOOD ONLINE SPEAKER DIARIZATION IN NOISY MEETINGS BASED ON CATEGORICAL MIXTURE MODEL AND PROBABILISTIC SPATIAL DICTIONARY
    Ito, Nobutaka
    Makino, Takashi
    Araki, Shoko
    Nakatani, Tomohiro
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 546 - 550