On Generative Spoken Language Modeling from Raw Audio

被引：94

作者：

Lakhotia, Kushal ^{[1
]}

Kharitonov, Eugene ^{[1
]}

Hsu, Wei-Ning ^{[1
]}

Adi, Yossi ^{[1
]}

Polyak, Adam ^{[1
]}

Bolte, Benjamin ^{[1
]}

Tu-Anh Nguyen ^{[1
,3
]}

Copet, Jade ^{[1
]}

Baevski, Alexei ^{[1
]}

Mohamed, Abdelrahman ^{[1
]}

Dupoux, Emmanuel ^{[1
,2
]}

机构：

[1] Facebook AI Res, Menlo Pk, CA 94025 USA

[2] EHESS, Paris, France

[3] INRIA, Paris, France

来源：

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS | 2021年 / 9卷

关键词：

UNSUPERVISED UNIT DISCOVERY; SPEECH REPRESENTATION; MARKOV MODEL; VARIATIONAL AUTOENCODER;

D O I：

10.1162/tacl_a_00430

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudotext), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.(1)

引用

页码：1336 / 1354

页数：19

共 50 条

[21] Modeling contrast in the generation and synthesis of spoken language
Prevost, S
ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1349 - 1352
[22] Language Modeling for Speech Recognition of Spoken Cantonese
Yeung, Yu Ting
Cao, Houwei
Zheng, N. H.
Lee, Tan
Ching, P. C.
INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 1570 - 1573
[23] Are Discrete Units Necessary for Spoken Language Modeling?
Nguyen, Tu Anh
Sagot, Benoit
Dupoux, Emmanuel
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1415 - 1423
[24] From spoken to written language
Casalis, S
A N A E-APPROCHE NEUROPSYCHOLOGIQUE DES APPRENTISSAGES CHEZ L ENFANT, 2001, 13 (2-3): : 75 - 77
[25] CONVERTING WRITTEN LANGUAGE TO SPOKEN LANGUAGE WITH NEURAL MACHINE TRANSLATION FOR LANGUAGE MODELING
Ando, Shintaro
Suzuki, Masayuki
Itoh, Nobuyasu
Kurata, Gakuto
Minematsu, Nobuaki
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8124 - 8128
[26] Integration of utterance verification with statistical language modeling and spoken language understanding
Rose, RC
Yao, H
Riccardi, G
Wright, J
SPEECH COMMUNICATION, 2001, 34 (04) : 321 - 331
[27] Integration of utterance verification with statistical language modeling and spoken language understanding
Rose, RC
Yao, H
Riccardi, G
Wright, J
PROCEEDINGS OF THE 1998 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-6, 1998, : 237 - 240
[28] An Exploration of Prompt Tuning on Generative Spoken Language Model for Speech Processing Tasks
Chang, Kai-Wei
Tseng, Wei-Cheng
Li, Shang-Wen
Lee, Hung-yi
INTERSPEECH 2022, 2022, : 5005 - 5009
[29] Automatic indexing of multimedia content by integration of audio, spoken language, and visual information
Ohtsuki, K
Bessho, K
Matsuo, Y
Matsunaga, S
Hayashi, Y
ASRU'03: 2003 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING ASRU '03, 2003, : 601 - 606
[30] A Generative Language Modeling Approach for Ranking Entities
Weerkamp, Wouter
Balog, Krisztian
Meij, Edgar
ADVANCES IN FOCUSED RETRIEVAL, 2009, 5631 : 292 - 299

← 1 2 3 4 5 →