On Generative Spoken Language Modeling from Raw Audio

被引:94
|
作者
Lakhotia, Kushal [1 ]
Kharitonov, Eugene [1 ]
Hsu, Wei-Ning [1 ]
Adi, Yossi [1 ]
Polyak, Adam [1 ]
Bolte, Benjamin [1 ]
Tu-Anh Nguyen [1 ,3 ]
Copet, Jade [1 ]
Baevski, Alexei [1 ]
Mohamed, Abdelrahman [1 ]
Dupoux, Emmanuel [1 ,2 ]
机构
[1] Facebook AI Res, Menlo Pk, CA 94025 USA
[2] EHESS, Paris, France
[3] INRIA, Paris, France
关键词
UNSUPERVISED UNIT DISCOVERY; SPEECH REPRESENTATION; MARKOV MODEL; VARIATIONAL AUTOENCODER;
D O I
10.1162/tacl_a_00430
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudotext), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.(1)
引用
收藏
页码:1336 / 1354
页数:19
相关论文
共 50 条
  • [21] Modeling contrast in the generation and synthesis of spoken language
    Prevost, S
    ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1349 - 1352
  • [22] Language Modeling for Speech Recognition of Spoken Cantonese
    Yeung, Yu Ting
    Cao, Houwei
    Zheng, N. H.
    Lee, Tan
    Ching, P. C.
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 1570 - 1573
  • [23] Are Discrete Units Necessary for Spoken Language Modeling?
    Nguyen, Tu Anh
    Sagot, Benoit
    Dupoux, Emmanuel
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1415 - 1423
  • [24] From spoken to written language
    Casalis, S
    A N A E-APPROCHE NEUROPSYCHOLOGIQUE DES APPRENTISSAGES CHEZ L ENFANT, 2001, 13 (2-3): : 75 - 77
  • [25] CONVERTING WRITTEN LANGUAGE TO SPOKEN LANGUAGE WITH NEURAL MACHINE TRANSLATION FOR LANGUAGE MODELING
    Ando, Shintaro
    Suzuki, Masayuki
    Itoh, Nobuyasu
    Kurata, Gakuto
    Minematsu, Nobuaki
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8124 - 8128
  • [26] Integration of utterance verification with statistical language modeling and spoken language understanding
    Rose, RC
    Yao, H
    Riccardi, G
    Wright, J
    SPEECH COMMUNICATION, 2001, 34 (04) : 321 - 331
  • [27] Integration of utterance verification with statistical language modeling and spoken language understanding
    Rose, RC
    Yao, H
    Riccardi, G
    Wright, J
    PROCEEDINGS OF THE 1998 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-6, 1998, : 237 - 240
  • [28] An Exploration of Prompt Tuning on Generative Spoken Language Model for Speech Processing Tasks
    Chang, Kai-Wei
    Tseng, Wei-Cheng
    Li, Shang-Wen
    Lee, Hung-yi
    INTERSPEECH 2022, 2022, : 5005 - 5009
  • [29] Automatic indexing of multimedia content by integration of audio, spoken language, and visual information
    Ohtsuki, K
    Bessho, K
    Matsuo, Y
    Matsunaga, S
    Hayashi, Y
    ASRU'03: 2003 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING ASRU '03, 2003, : 601 - 606
  • [30] A Generative Language Modeling Approach for Ranking Entities
    Weerkamp, Wouter
    Balog, Krisztian
    Meij, Edgar
    ADVANCES IN FOCUSED RETRIEVAL, 2009, 5631 : 292 - 299