ARTSTREAM: a neural network model of auditory scene analysis and source segregation

被引：51

作者：

Grossberg, S

Govindarajan, KK

Wyse, LL

Cohen, MA

机构：

[1] Boston Univ, Dept Cognit & Neural Syst, Ctr Adapt Syst, Boston, MA 02215 USA

[2] SpeechWorks Int, Boston, MA 02111 USA

[3] Informat Technol Lab, Singapore 119613, Singapore

来源：

NEURAL NETWORKS | 2004年 / 17卷 / 04期

基金：

美国国家科学基金会;

关键词：

auditory scene analysis; streaming; cocktail party problem; pitch perception; spatial localization; neural network; resonance; adaptive resonance theory; spectral-pitch resonance;

D O I：

10.1016/j.neunet.2003.10.002

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multiple sound sources often contain harmonics that overlap and may be degraded by environmental noise. The auditory system is capable of teasing apart these sources into distinct mental objects, or streams. Such an 'auditory scene analysis' enables the brain to solve the cocktail party problem. A neural network model of auditory scene analysis, called the ARTSTREAM model, is presented to propose how the brain accomplishes this feat. The model clarifies how the frequency components that correspond to a given acoustic source may be coherently grouped to-ether into a distinct stream based on pitch and spatial location Cues. The model also clarifies how multiple streams may be distinguished and separated by the brain. Streams are formed as spectral-pitch resonances that emerge through feedback interactions between frequency-specific spectral representations of a sound Source and its pitch. First, the model transforms a sound into a spatial pattern of frequency-specific activation across a spectral stream layer. The sound has multiple parallel representations at this layer. A sound's spectral representation activates it bottom-up filter that is sensitive to the harmonics of the sound's pitch. This filter activates a pitch category which, in turn, activates a top-down expectation that is also sensitive to the harmonics of the pitch. Resonance develops when the spectral and pitch representations mutually reinforce one another. Resonance provides the coherence that allows one voice or instrument to be tracked through a noisy multiple source environment. Spectral components are suppressed if they do not match harmonics of the top-down expectation that is read-out by the selected pitch, thereby allowing another stream to capture these components, as in the 'old-plus-new heuristic' of Bregman. Multiple simultaneously occurring spectral-pitch resonances can hereby emerge. These resonance and matching mechanisms are specialized versions of Adaptive Resonance Theory, or ART, which clarifies how pitch representations can self-organize during learning of harmonic bottom-up filters and top-down expectations. The model also clarifies how spatial location cues can help to disambiguate two sources with similar spectral cues. Data are simulated front psychophysical grouping experiments, such as how a tone sweeping upwards in frequency creates a bounce percept by grouping with a downward sweeping tone due to proximity in frequency, even if noise replaces the tones at their intersection point. Illusory auditory percepts are also simulated, such as the auditory continuity illusion of a tone continuing through a noise burst even if the tone is not present during the noise, and the scale illusion of Deutsch whereby downward and upward scales presented alternately to the two ears are regrouped based on frequency proximity, leading to a bounce percept. Since related sorts of resonances have been used to quantitatively simulate psychophysical data about speech perception, the model strengthens the hypothesis that ART-like mechanisms are used at multiple levels of the auditory system. Proposals for developing the model to explain more complex streaming data are also provided. (C) 2004 Elsevier Ltd. All rights reserved.

引用

页码：511 / 536

页数：26

共 81 条

[1]

Albert S. Bregman, 1990, AUDITORY SCENE ANAL, P411, DOI [DOI 10.1121/1.408434, DOI 10.7551/MITPRESS/1486.001.0001]

[2]

[Anonymous], 1972, ASPECTS MOTION PERCE

[3] HEARING THEORIES AND COMPLEX SOUNDS [J].

BEKESY, GV .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1963, 35 (04) :588-&

[4] SOME PARAMETERS INFLUENCING PERCEPTIBILITY OF PITCH [J].

BILSEN, FA ;

RITSMA, RJ .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1970, 47 (02) :469-&

[5] FUSION OF SIMULTANEOUS TONAL GLIDES - THE ROLE OF PARALLELNESS AND SIMPLE FREQUENCY RELATIONS [J].

BREGMAN, AS ;

DOEHRING, P .

PERCEPTION & PSYCHOPHYSICS, 1984, 36 (03) :251-256

[6] AUDITORY SEGREGATION - STREAM OR STREAMS [J].

BREGMAN, AS ;

RUDNICKY, AI .

JOURNAL OF EXPERIMENTAL PSYCHOLOGY-HUMAN PERCEPTION AND PERFORMANCE, 1975, 1 (03) :263-267

[7] PRIMARY AUDITORY STREAM SEGREGATION AND PERCEPTION OF ORDER IN RAPID SEQUENCES OF TONES [J].

BREGMAN, AS ;

CAMPBELL, J .

JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 1971, 89 (02) :244-&

[8] AUDITORY STREAMING AND BUILDING OF TIMBER [J].

BREGMAN, AS ;

PINKER, S .

CANADIAN JOURNAL OF PSYCHOLOGY-REVUE CANADIENNE DE PSYCHOLOGIE, 1978, 32 (01) :19-31

[9] AUDITORY CONTINUITY AND AMPLITUDE EDGES [J].

BREGMAN, AS ;

DANNENBRING, GL .

CANADIAN JOURNAL OF PSYCHOLOGY-REVUE CANADIENNE DE PSYCHOLOGIE, 1977, 31 (03) :151-159

[10] AUDITORY STREAMING AND VERTICAL LOCALIZATION - INTERDEPENDENCE OF WHAT AND WHERE DECISIONS IN AUDITION [J].

BREGMAN, AS ;

STEIGER, H .

PERCEPTION & PSYCHOPHYSICS, 1980, 28 (06) :539-546

← 1 2 3 4 5 6 7 8 9 →