On Generative Spoken Language Modeling from Raw Audio

被引:94
|
作者
Lakhotia, Kushal [1 ]
Kharitonov, Eugene [1 ]
Hsu, Wei-Ning [1 ]
Adi, Yossi [1 ]
Polyak, Adam [1 ]
Bolte, Benjamin [1 ]
Tu-Anh Nguyen [1 ,3 ]
Copet, Jade [1 ]
Baevski, Alexei [1 ]
Mohamed, Abdelrahman [1 ]
Dupoux, Emmanuel [1 ,2 ]
机构
[1] Facebook AI Res, Menlo Pk, CA 94025 USA
[2] EHESS, Paris, France
[3] INRIA, Paris, France
关键词
UNSUPERVISED UNIT DISCOVERY; SPEECH REPRESENTATION; MARKOV MODEL; VARIATIONAL AUTOENCODER;
D O I
10.1162/tacl_a_00430
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudotext), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.(1)
引用
收藏
页码:1336 / 1354
页数:19
相关论文
共 50 条
  • [31] Sequential Dialogue Context Modeling for Spoken Language Understanding
    Bapna, Ankur
    Tur, Gokhan
    Hakkani-Tur, Dilek
    Heck, Larry
    18TH ANNUAL MEETING OF THE SPECIAL INTEREST GROUP ON DISCOURSE AND DIALOGUE (SIGDIAL 2017), 2017, : 103 - 114
  • [32] A vector space modeling approach to spoken language identification
    Li, Haizhou
    Ma, Bin
    Lee, Chin-Hui
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (01): : 271 - 284
  • [33] SPOKEN LANGUAGE RECOGNITION WITH CLUSTER-BASED MODELING
    Kacprzak, Stanislaw
    Rybicka, Magdalena
    Kowalczyk, Konrad
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6867 - 6871
  • [34] AudioLM: A Language Modeling Approach to Audio Generation
    Borsos, Zalan
    Marinier, Raphael
    Vincent, Damien
    Kharitonov, Eugene
    Pietquin, Olivier
    Sharifi, Matt
    Roblek, Dominik
    Teboul, Olivier
    Grangier, David
    Tagliasacchi, Marco
    Zeghidour, Neil
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 2523 - 2533
  • [35] Self-Supervised Language Learning From Raw Audio: Lessons From the Zero Resource Speech Challenge
    Dunbar, Ewan
    Hamilakis, Nicolas
    Dupoux, Emmanuel
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1211 - 1226
  • [36] DIRECT MODELING OF RAW AUDIO WITH DNNS FOR WAKE WORD DETECTION
    Kumatani, Kenichi
    Panchapagesan, Sankaran
    Wu, Minhua
    Kim, Minjae
    Strom, Nikko
    Tiwari, Gautam
    Mandal, Arindam
    2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2017, : 252 - 257
  • [37] THE TRANSITION FROM SPOKEN TO WRITTEN LANGUAGE
    BRYANT, P
    ALEGRIA, J
    TRANSITION MECHANISMS IN CHILD DEVELOPMENT : THE LONGITUDINAL PERSPECTIVE, 1989, : 126 - 144
  • [38] Generative and discriminative modeling toward semantic context detection in audio tracks
    Chu, WT
    Cheng, WH
    Wu, JL
    11TH INTERNATIONAL MULTIMEDIA MODELLING CONFERENCE, PROCEEDINGS, 2005, : 38 - 45
  • [39] Accessing information in spoken audio
    Renals, S
    Robinson, T
    SPEECH COMMUNICATION, 2000, 32 (1-2) : 1 - 3
  • [40] Short-Spoken Language Intent Classification with Conditional Sequence Generative Adversarial Network
    Zhou, Xinyu
    Peng, Yang
    2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019), 2019, : 1753 - 1756