Words shaping worlds: A comprehensive exploration of text-driven image and video with adversarial networks

被引:0
作者
Khalid, Mohd Nor Akmal [1 ]
Ullah, Anwar [2 ]
Numan, Muhammad [3 ]
Majid, Abdul [4 ]
机构
[1] Univ Kebangsaan Malaysia, Bangi, Malaysia
[2] Cent China Normal Univ, Wuhan, Peoples R China
[3] Japan Adv Inst Sci & Technol, Nomi, Japan
[4] Wuhan Univ, Wuhan, Peoples R China
关键词
Artificial intelligence; Generative adversarial network (GAN); Text-to-image (T2I); Text-to-video (T2V); Qualitative evaluation; Quantitative evaluation; COMPUTER VISION; GENERATION; ATTENTION; MOTION; GAN;
D O I
10.1016/j.neucom.2025.129767
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A growing interest in education, media, and entertainment has increased Artificial Intelligence-powered content generation, mainly via Generative Adversarial Networks (GANs), which shape images, videos, audio, and text. A generative adversarial network (GAN) is the combination of two deep neural networks: generator (G) and discriminator (D). These two components are trained competitively by pitting one against the other such that G generates new data while D authenticates the data. By leveraging powerful deep neural networks and competitive training, GANs can synthesize reasonable and realistic images and videos from the text description. This paper extensively reviews the recent state-of-the-art GAN models for text-to-image (T2I) and text-to-video (T2V) synthesis. In this regard, databases like ACM, IEEE Explore, Web of Science, and ScienceDirect were searched to find and analyze the relevant research articles conducted in this area in the last decade, specifically from 2014 to 2024. Secondly, T2I and T2V GAN methods were classified according to structure and functionality. Later, a comprehensive evaluation between T2I and T2V GAN-based methods was conducted, employing various qualitative and quantitative evaluation techniques. Finally, the paper concludes by discussing multiple applications, main challenges, and limitations of T2I and T2V GAN models for future consideration.
引用
收藏
页数:16
相关论文
共 122 条
[1]   STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition [J].
Ahn, Dasom ;
Kim, Sangwon ;
Hong, Hyunsu ;
Ko, Byoung Chul .
2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, :3319-3328
[2]  
Ahn H., 2017, 2018 IEEE INT C ROB, P1
[3]   Threat of Adversarial Attacks on Deep Learning in Computer Vision: A Survey [J].
Akhtar, Naveed ;
Mian, Ajmal .
IEEE ACCESS, 2018, 6 :14410-14430
[4]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[5]   2D Human Pose Estimation: New Benchmark and State of the Art Analysis [J].
Andriluka, Mykhaylo ;
Pishchulin, Leonid ;
Gehler, Peter ;
Schiele, Bernt .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :3686-3693
[6]   Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space [J].
Anh Nguyen ;
Clune, Jeff ;
Bengio, Yoshua ;
Dosovitskiy, Alexey ;
Yosinski, Jason .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :3510-3520
[7]  
Arjovsky M, 2017, PR MACH LEARN RES, V70
[8]  
Balaji Y, 2019, PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P1995
[9]   HP-GAN: Probabilistic 3D human motion prediction via GAN [J].
Barsoum, Emad ;
Kender, John ;
Liu, Zicheng .
PROCEEDINGS 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), 2018, :1499-1508
[10]  
Bodla N, 2018, Arxiv, DOI arXiv:1801.05551