Generative Adversarial Networks (GANs) for Audio-Visual Speech Recognition in Artificial Intelligence IoT

被引：8

作者：

He, Yibo ^{[1
]}

Seng, Kah Phooi ^{[1
,2
,3
]}

Ang, Li Minn ^{[3
]}

机构：

[1] Xian Jiaotong Liverpool Univ, Sch AI & Adv Comp, Suzhou 215000, Peoples R China

[2] Queensland Univ Technol, Sch Comp Sci, Brisbane, Qld 4000, Australia

[3] Univ Sunshine Coast, Sch Sci Technol & Engn, Sippy Downs, Qld 4556, Australia

来源：

INFORMATION | 2023年 / 14卷 / 10期

关键词：

Internet of things (IoT); generative adversarial networks (GANs); deep learning; audio-visual speech recognition;

D O I：

10.3390/info14100575

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper proposes a novel multimodal generative adversarial network AVSR (multimodal AVSR GAN) architecture, to improve both the energy efficiency and the AVSR classification accuracy of artificial intelligence Internet of things (IoT) applications. The audio-visual speech recognition (AVSR) modality is a classical multimodal modality, which is commonly used in IoT and embedded systems. Examples of suitable IoT applications include in-cabin speech recognition systems for driving systems, AVSR in augmented reality environments, and interactive applications such as virtual aquariums. The application of multimodal sensor data for IoT applications requires efficient information processing, to meet the hardware constraints of IoT devices. The proposed multimodal AVSR GAN architecture is composed of a discriminator and a generator, each of which is a two-stream network, corresponding to the audio stream information and the visual stream information, respectively. To validate this approach, we used augmented data from well-known datasets (LRS2-Lip Reading Sentences 2 and LRS3) in the training process, and testing was performed using the original data. The research and experimental results showed that the proposed multimodal AVSR GAN architecture improved the AVSR classification accuracy. Furthermore, in this study, we discuss the domain of GANs and provide a concise summary of the proposed GANs.

引用

页数：23

共 35 条

[11] A Style-Based Generator Architecture for Generative Adversarial Networks [J].

Karras, Tero ;

Laine, Samuli ;

Aila, Timo .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :4396-4405

[12]

Kinjo T, 2006, IEEE IND ELEC, P2605

[13] Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network [J].

Ledig, Christian ;

Theis, Lucas ;

Huszar, Ferenc ;

Caballero, Jose ;

Cunningham, Andrew ;

Acosta, Alejandro ;

Aitken, Andrew ;

Tejani, Alykhan ;

Totz, Johannes ;

Wang, Zehan ;

Shi, Wenzhe .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :105-114

[14] Privacy-Preserving Outsourced Speech Recognition for Smart IoT Devices [J].

Ma, Zhuo ;

Liu, Yang ;

Liu, Ximeng ;

Ma, Jianfeng ;

Li, Feifei .

IEEE INTERNET OF THINGS JOURNAL, 2019, 6 (05) :8406-8420

[15]

Mehrabani M, 2015, 2015 IEEE 2ND WORLD FORUM ON INTERNET OF THINGS (WF-IOT), P369, DOI 10.1109/WF-IoT.2015.7389082

[16] Semantic Image Synthesis with Spatially-Adaptive Normalization [J].

Park, Taesung ;

Liu, Ming-Yu ;

Wang, Ting-Chun ;

Zhu, Jun-Yan .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :2332-2341

[17] A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild [J].

Prajwal, K. R. ;

Mukhopadhyay, Rudrabha ;

Namboodiri, Vinay P. ;

Jawahar, C. V. .

MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, :484-492

[18]

Radford A, 2016, Arxiv, DOI [arXiv:1511.06434, DOI 10.48550/ARXIV.1511.06434]

[19] Radio frequency identification (RFID) [J].

Roberts, CM .

COMPUTERS & SECURITY, 2006, 25 (01) :18-26

[20] ImageNet Large Scale Visual Recognition Challenge [J].

Russakovsky, Olga ;

Deng, Jia ;

Su, Hao ;

Krause, Jonathan ;

Satheesh, Sanjeev ;

Ma, Sean ;

Huang, Zhiheng ;

Karpathy, Andrej ;

Khosla, Aditya ;

Bernstein, Michael ;

Berg, Alexander C. ;

Fei-Fei, Li .

INTERNATIONAL JOURNAL OF COMPUTER VISION, 2015, 115 (03) :211-252

← 1 2 3 4 →