End-to-End Image-to-Speech Generation for Untranscribed Unknown Languages

被引:9
作者
Effendi, Johanes [1 ,2 ]
Sakti, Sakriani [1 ,2 ]
Nakamura, Satoshi [1 ,2 ]
机构
[1] Nara Inst Sci & Technol, Ikoma 6300192, Japan
[2] RIKEN Ctr Adv Intelligence Project AIP, Tokyo 1030027, Japan
基金
日本学术振兴会;
关键词
Task analysis; Image reconstruction; Decoding; Training; Bridges; Speech recognition; Data models; Image-to-speech; image captioning; self-supervised speech representation; vector-quantized variational autoencoder; untranscribed unknown language;
D O I
10.1109/ACCESS.2021.3071541
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Describing orally what we are seeing is a simple task we do in our daily life. However, in the natural language processing field, this simple task needs to be bridged by a textual modality that helps the system to generalize various objects in the image and various pronunciations in speech utterances. In this study, we propose an end-to-end Image2Speech system that does not need any textual information in its training. We use a vector-quantized variational autoencoder (VQ-VAE) model to learn the discrete representation of a speech caption in an unsupervised manner, where discrete labels are used by an image-captioning model. This self-supervised speech representation enables the Image2Speech model to be trained with the minimum amount of paired image-speech data while still maintaining the quality of the speech caption. Our experimental results with a multi-speaker natural speech dataset demonstrate our proposed text-free Image2Speech system's performance close to the one with textual information. Furthermore, our approach also successfully outperforms the most recent existing frameworks with phoneme-based and grounding-based Image2Speech systems.
引用
收藏
页码:55144 / 55154
页数:11
相关论文
共 46 条
[1]   Breaking the Unwritten Language Barrier: The BULB Project [J].
Adda, Gilles ;
Stueker, Sebastian ;
Adda-Decker, Martine ;
Ambouroue, Odette ;
Besacier, Laurent ;
Blachon, David ;
Bonneau-Maynard, Helene ;
Godard, Pierre ;
Hamlaoui, Fatima ;
Idiatov, Dmitry ;
Kouarata, Guy-Noel ;
Lamel, Lori ;
Makasso, Emmanuel-Moselly ;
Rialland, Annie ;
de Velde, Mark Van ;
Yvon, Francois ;
Zerbian, Sabine .
SLTU-2016 5TH WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGIES FOR UNDER-RESOURCED LANGUAGES, 2016, 81 :8-14
[2]  
[Anonymous], 2008, Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, DOI DOI 10.3115/1557690.1557736
[3]  
[Anonymous], C HUMAN FACTORS COMP, DOI DOI 10.1145/3544548.3580645
[4]  
[Anonymous], 2015, P 14 PYTH SCI C, DOI 10.25080/Majora-7b98e3ed-003
[5]  
Ba, 2015, P ICLR
[6]  
Banerjee S., 2005, P ACL WORKSH INTR EX, P65
[7]  
Banerjee Satanjeev, 2005, P ACL WORKSH INTR EX, P65, DOI DOI 10.3115/1626355.1626389
[8]  
Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[9]  
Chin-Yew Lin, 2004, Text Summarization Branches Out, P74
[10]  
Dunbar E., 2020, P INT 2020 21 ANN C, P4831, DOI [10.21437/Interspeech.2020-2743, DOI 10.21437/INTERSPEECH.2020-2743]