Localization and selection of speaker-specific information with statistical modeling

被引：30

作者：

Besacier, L ^{[1
]}

Bonastre, JF ^{[1
]}

Fredouille, C ^{[1
]}

机构：

[1] Lab Informat Avignon LIA CERI Agroparc, F-84911 Avignon 9, France

来源：

SPEECH COMMUNICATION | 2000年 / 31卷 / 2-3期

关键词：

speaker recognition; speaker-specific information; on-line selection; pruning; statistical modeling; time-frequency architecture;

D O I：

10.1016/S0167-6393(99)00070-9

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Statistical modeling of the speech signal has been widely used in speaker recognition. The performance obtained with this type of modeling is excellent in laboratories but decreases dramatically for telephone or noisy speech. Moreover, it is difficult to know which piece of information is taken into account by the system. In order to solve this problem and to improve the current systems, a better understanding of the nature of the information used by statistical methods is needed. This knowledge should allow to select only the relevant information or to add new sources of information. The first part of this paper presents experiments that aim at localizing the most useful acoustic events for speaker recognition. The relation between the discriminant ability and the speech's events nature is studied. particularly, the phonetic content, the signal stability and the frequency domain are explored. Finally, the potential of dynamic information contained in the relation between a frame and its p neighbours is investigated. In the second part, the authors suggest a new selection procedure designed to select the pertinent features. Conventional feature selection techniques (ascendant selection, knock-out) allow only global and a posteriori knowledge about the relevance of an information source. However, some speech clusters may be very efficient to recognize a particular speaker, whereas they can be non-informative for another one. Moreover, some information classes may be corrupted or even missing for particular recording conditions. This necessity for speaker-specific processing and for adaptability to the environment (with no a priori knowledge of the degradation affecting the signal) leads the authors to propose a system that automatically selects the most discriminant parts of a speech utterance. The proposed architecture divides the signal into different time-frequency blocks. The likelihood is calculated after dynamically selecting the most useful blocks. This information selection leads to a significative error rate reduction (up to 41% of relative error rate decrease on TIMIT) for short training and test durations. Finally, experiments in the case of simulated noise degradation show that this approach is a very efficient way to deal with partially corrupted speech. (C) 2000 Published by Elsevier Science B.V. All rights reserved.

引用

页码：89 / 106

页数：18

共 28 条

[1]

[Anonymous], WORKSH AUT SPEAK REC

[2] ON INSTANTANEOUS AND TRANSITIONAL SPECTRAL INFORMATION FOR TEXT-DEPENDENT SPEAKER VERIFICATION [J].

BERNASCONI, C .

SPEECH COMMUNICATION, 1990, 9 (02) :129-139

[3]

BESACIER L, 1998, P C SPEAK REC ITS CO

[4]

BESACIER L, 1998, P IEEE INT C AC SPEE

[5]

BESACIER L, 1997, P 1 INT C AUD VIS BA, P195

[6]

BESACIER L, 1998, THESIS U AVIGNON

[7] 2ND-ORDER STATISTICAL MEASURES FOR TEXT-INDEPENDENT SPEAKER IDENTIFICATION [J].

BIMBOT, F ;

MAGRINCHAGNOLLEAU, I ;

MATHAN, L .

SPEECH COMMUNICATION, 1995, 17 (1-2) :177-192

[8]

BONASTRE JF, 1994, WORKSH AUT SPEAK REC, P157

[9]

CHARLET D, 1998, P C SPEAK REC ITS CO

[10]

CHARLET D, 1996, 21 JOURN ET PAR AV F, P399

← 1 2 3 →