End-to-End Image-to-Speech Generation for Untranscribed Unknown Languages

被引：9

作者：

Effendi, Johanes ^{[1
,2
]}

Sakti, Sakriani ^{[1
,2
]}

Nakamura, Satoshi ^{[1
,2
]}

机构：

[1] Nara Inst Sci & Technol, Ikoma 6300192, Japan

[2] RIKEN Ctr Adv Intelligence Project AIP, Tokyo 1030027, Japan

来源：

IEEE ACCESS | 2021年 / 9卷

基金：

日本学术振兴会;

关键词：

Task analysis; Image reconstruction; Decoding; Training; Bridges; Speech recognition; Data models; Image-to-speech; image captioning; self-supervised speech representation; vector-quantized variational autoencoder; untranscribed unknown language;

D O I：

10.1109/ACCESS.2021.3071541

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Describing orally what we are seeing is a simple task we do in our daily life. However, in the natural language processing field, this simple task needs to be bridged by a textual modality that helps the system to generalize various objects in the image and various pronunciations in speech utterances. In this study, we propose an end-to-end Image2Speech system that does not need any textual information in its training. We use a vector-quantized variational autoencoder (VQ-VAE) model to learn the discrete representation of a speech caption in an unsupervised manner, where discrete labels are used by an image-captioning model. This self-supervised speech representation enables the Image2Speech model to be trained with the minimum amount of paired image-speech data while still maintaining the quality of the speech caption. Our experimental results with a multi-speaker natural speech dataset demonstrate our proposed text-free Image2Speech system's performance close to the one with textual information. Furthermore, our approach also successfully outperforms the most recent existing frameworks with phoneme-based and grounding-based Image2Speech systems.

引用

页码：55144 / 55154

页数：11

共 46 条

[1] Breaking the Unwritten Language Barrier: The BULB Project [J].

Adda, Gilles ;

Stueker, Sebastian ;

Adda-Decker, Martine ;

Ambouroue, Odette ;

Besacier, Laurent ;

Blachon, David ;

Bonneau-Maynard, Helene ;

Godard, Pierre ;

Hamlaoui, Fatima ;

Idiatov, Dmitry ;

Kouarata, Guy-Noel ;

Lamel, Lori ;

Makasso, Emmanuel-Moselly ;

Rialland, Annie ;

de Velde, Mark Van ;

Yvon, Francois ;

Zerbian, Sabine .

SLTU-2016 5TH WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGIES FOR UNDER-RESOURCED LANGUAGES, 2016, 81 :8-14

[2]

[Anonymous], 2008, Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, DOI DOI 10.3115/1557690.1557736

[3]

[Anonymous], C HUMAN FACTORS COMP, DOI DOI 10.1145/3544548.3580645

[4]

[Anonymous], 2015, P 14 PYTH SCI C, DOI 10.25080/Majora-7b98e3ed-003

[5]

Ba, 2015, P ICLR

[6]

Banerjee S., 2005, P ACL WORKSH INTR EX, P65

[7]

Banerjee Satanjeev, 2005, P ACL WORKSH INTR EX, P65, DOI DOI 10.3115/1626355.1626389

[8]

Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621

[9]

Chin-Yew Lin, 2004, Text Summarization Branches Out, P74

[10]

Dunbar E., 2020, P INT 2020 21 ANN C, P4831, DOI [10.21437/Interspeech.2020-2743, DOI 10.21437/INTERSPEECH.2020-2743]

← 1 2 3 4 5 →