Listening with generative models

被引:0
作者
Cusimano, Maddie [1 ]
Hewitt, Luke B. [1 ]
McDermott, Josh H. [1 ,2 ,3 ,4 ]
机构
[1] MIT, Dept Brain & Cognit Sci, Cambridge, MA 02139 USA
[2] MIT, McGovern Inst, Cambridge, MA USA
[3] MIT, Ctr Brains Minds & Machines, Cambridge, MA USA
[4] Harvard Univ, Speech & Hearing Biosci & Technol Program, Cambridge, MA USA
关键词
Auditory scene analysis; Bayesian inference; Illusions; Grouping; Perceptual organization; Natural sounds; Probabilistic program; World model; Perception; COCKTAIL PARTY; GESTALT PSYCHOLOGY; NEWBORN-INFANTS; SOUND SOURCES; PERCEPTION; SPEECH; SEPARATION; ORGANIZATION; STATISTICS; STREAM;
D O I
10.1016/j.cognition.2024.105874
中图分类号
B84 [心理学];
学科分类号
04 ; 0402 ;
摘要
Perception has long been envisioned to use an internal model of the world to explain the causes of sensory signals. However, such accounts have historically not been testable, typically requiring intractable search through the space of possible explanations. Using auditory scenes as a case study, we leveraged contemporary computational tools to infer explanations of sounds in a candidate internal generative model of the auditory world (ecologically inspired audio synthesizers). Model inferences accounted for many classic illusions. Unlike traditional accounts of auditory illusions, the model is applicable to any sound, and exhibited human-like perceptual organization for real-world sound mixtures. The combination of stimulus-computability and interpretable model structure enabled 'rich falsification', revealing additional assumptions about sound generation needed to account for perception. The results show how generative models can account for the perception of both classic illusions and everyday sensory signals, and illustrate the opportunities and challenges involved in incorporating them into theories of perception.
引用
收藏
页数:64
相关论文
共 191 条
  • [1] Acoustical Society of America, 2013, ANSI/ASA S1.1-2013: Acoustical Terminology
  • [2] A comparison of vowel normalization procedures for language variation research
    Adank, P
    Smits, R
    van Hout, R
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2004, 116 (05) : 3099 - 3107
  • [3] Adelson EH, 1996, PERCEPTION BAYESIAN, P409, DOI [10.1017/CBO9780511984037.014, DOI 10.1017/CBO9780511984037.014]
  • [4] Agarwal V, 2021, INT CONF DIGIT AUDIO, P136, DOI 10.23919/DAFx51585.2021.9768225
  • [5] Selectively attending to auditory objects
    Alain, C
    Arnott, SR
    [J]. FRONTIERS IN BIOSCIENCE-LANDMARK, 2000, 5 : D202 - D212
  • [6] Andrychowicz M, 2016, ADV NEUR IN, V29
  • [7] Decoding speech in the presence of other sources
    Barker, JP
    Cooke, MP
    Ellis, DPW
    [J]. SPEECH COMMUNICATION, 2005, 45 (01) : 5 - 25
  • [8] Auditory Streaming as an Online Classification Process with Evidence Accumulation
    Barniv, Dana
    Nelken, Israel
    [J]. PLOS ONE, 2015, 10 (12):
  • [9] VISUAL-ATTENTION AND OBJECTS - EVIDENCE FOR HIERARCHICAL CODING OF LOCATION
    BAYLIS, GC
    DRIVER, J
    [J]. JOURNAL OF EXPERIMENTAL PSYCHOLOGY-HUMAN PERCEPTION AND PERFORMANCE, 1993, 19 (03) : 451 - 470
  • [10] Pure-tone birdsong by resonance filtering of harmonic overtones
    Beckers, GJL
    Suthers, RA
    ten Cate, C
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2003, 100 (12) : 7372 - 7376