On Generative Spoken Language Modeling from Raw Audio

被引:94
|
作者
Lakhotia, Kushal [1 ]
Kharitonov, Eugene [1 ]
Hsu, Wei-Ning [1 ]
Adi, Yossi [1 ]
Polyak, Adam [1 ]
Bolte, Benjamin [1 ]
Tu-Anh Nguyen [1 ,3 ]
Copet, Jade [1 ]
Baevski, Alexei [1 ]
Mohamed, Abdelrahman [1 ]
Dupoux, Emmanuel [1 ,2 ]
机构
[1] Facebook AI Res, Menlo Pk, CA 94025 USA
[2] EHESS, Paris, France
[3] INRIA, Paris, France
关键词
UNSUPERVISED UNIT DISCOVERY; SPEECH REPRESENTATION; MARKOV MODEL; VARIATIONAL AUTOENCODER;
D O I
10.1162/tacl_a_00430
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudotext), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.(1)
引用
收藏
页码:1336 / 1354
页数:19
相关论文
共 50 条
  • [1] Generative Spoken Dialogue Language Modeling
    Nguyen, Tu Anh
    Kharitonov, Eugene
    Copet, Jade
    Adi, Yossi
    Hsu, Wei-Ning
    Elkahky, Ali
    Tomasello, Paden
    Algayres, Robin
    Sagot, Benoit
    Mohamed, Abdelrahman
    Dupoux, Emmanuel
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2023, 11 : 250 - 266
  • [2] SchrodingeRNN: Generative Modeling of Raw Audio as a Continuously Observed Quantum State
    Uranga, Benat Mencia
    Lamacraft, Austen
    MATHEMATICAL AND SCIENTIFIC MACHINE LEARNING, VOL 107, 2020, 107 : 74 - 106
  • [3] Generative Spoken Language Model based on continuous word-sized audio tokens
    Algayres, Robin
    Adi, Yossi
    Tu Anh Nguyen
    Copet, Jade
    Synnaeve, Gabriel
    Sagot, Benoit
    Dupoux, Emmanuel
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 3008 - 3028
  • [4] FloWaveNet : A Generative Flow for Raw Audio
    Kim, Sungwon
    Lee, Sang-gil
    Song, Jongyoon
    Kim, Jaehyeon
    Yoon, Sungroh
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [5] How Generative Spoken Language Modeling Encodes Noisy Speech: Investigation from Phonetics to Syntactics
    Park, Joonyong
    Takamichi, Shinnosuke
    Nakamura, Tomohiko
    Seki, Kentaro
    Xin, Detai
    Saruwatari, Hiroshi
    INTERSPEECH 2023, 2023, : 1085 - 1089
  • [6] Text-Free Prosody-Aware Generative Spoken Language Modeling
    Kharitonov, Eugene
    Lee, Ann
    Polyak, Adam
    Adi, Yossi
    Copet, Jade
    Lakhotia, Kushal
    Tu-Anh Nguyen
    Riviere, Morgane
    Mohamed, Abdelrahman
    Dupoux, Emmanuel
    Wei-Ning Hsu
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 8666 - 8681
  • [7] Generative and Discriminative Algorithms for Spoken Language Understanding
    Raymond, Christian
    Riccardi, Giuseppe
    INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 413 - 416
  • [8] A GENERATIVE MODEL FOR RAW AUDIO USING TRANSFORMER ARCHITECTURES
    Verma, Prateek
    Chafe, Chris
    2021 24TH INTERNATIONAL CONFERENCE ON DIGITAL AUDIO EFFECTS (DAFX), 2021, : 230 - 237
  • [9] FROM AUDIO TO SEMANTICS: APPROACHES TO END-TO-END SPOKEN LANGUAGE UNDERSTANDING
    Haghani, Parisa
    Narayanan, Arun
    Bacchiani, Michiel
    Chuang, Galen
    Gaur, Neeraj
    Moreno, Pedro
    Prabhavalkar, Rohit
    Qu, Zhongdi
    Waters, Austin
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 720 - 726
  • [10] JOINT GENERATIVE AND DISCRIMINATIVE MODELS FOR SPOKEN LANGUAGE UNDERSTANDING
    Dinarelli, Marco
    Moschitti, Alessandro
    Riccardi, Giuseppe
    2008 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY: SLT 2008, PROCEEDINGS, 2008, : 61 - 64