Increasing Importance of Joint Analysis of Audio and Video in Computer Vision: A Survey

被引:2
作者
Shahabaz, Ahmed [1 ]
Sarkar, Sudeep [1 ]
机构
[1] Univ S Florida, Dept Comp Sci & Engn, Tampa, FL 33620 USA
基金
美国国家科学基金会;
关键词
Task analysis; Visualization; Deep learning; Surveys; Reviews; Location awareness; Hidden Markov models; Computer vision; Audio-visual systems; Multisensory integration; audio-video analysis; contrastive learning; multi-modal analysis; AUDIOVISUAL AFFECT RECOGNITION; EMOTION RECOGNITION; FUSION; SYNCHRONIZATION; SEGMENTATION; SPEECH; SOUND; CLASSIFICATION; EXPRESSIONS; FEATURES;
D O I
10.1109/ACCESS.2024.3391817
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The joint analysis of audio and video is a powerful tool that can be applied to various contexts, including action, speech, and sound recognition, audio-visual video parsing, emotion recognition in affective computing, and self-supervised training of deep learning models. Solving these problems often involves tackling core audio-visual tasks, such as audio-visual source localization, audio-visual correspondence, and audio-visual source separation, which can be combined in various ways to achieve the desired results. This paper provides a review of the literature in this area, discussing the advancements, history, and datasets of audio-visual learning methods for various application domains. It also presents an overview of the reported performances on standard datasets and suggests promising directions for future research.
引用
收藏
页码:59399 / 59430
页数:32
相关论文
共 291 条
[1]   A Novel Real-Time, Lightweight Chaotic-Encryption Scheme for Next-Generation Audio-Visual Hearing Aids [J].
Adeel, Ahsan ;
Ahmad, Jawad ;
Larijani, Hadi ;
Hussain, Amir .
COGNITIVE COMPUTATION, 2020, 12 (03) :589-601
[2]   Deep Audio-Visual Speech Recognition [J].
Afouras, Triantafyllos ;
Chung, Joon Son ;
Senior, Andrew ;
Vinyals, Oriol ;
Zisserman, Andrew .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) :8717-8727
[3]  
Afouras T, 2022, Arxiv, DOI [arXiv:2104.06401, 10.48550/ARXIV.2104.06401, DOI 10.48550/ARXIV.2104.06401]
[4]   Self-supervised object detection from audio-visual correspondence [J].
Afouras, Triantafyllos ;
Asano, Yuki M. ;
Fagan, Francois ;
Vedaldi, Andrea ;
Metze, Florian .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :10565-10576
[5]  
Afouras T, 2018, Arxiv, DOI arXiv:1809.00496
[6]   Self-supervised Learning of Audio-Visual Objects from Video [J].
Afouras, Triantafyllos ;
Owens, Andrew ;
Chung, Joon Son ;
Zisserman, Andrew .
COMPUTER VISION - ECCV 2020, PT XVIII, 2020, 12363 :208-224
[7]   Audio-Visual Face Reenactment [J].
Agarwal, Madhav ;
Mukhopadhyay, Rudrabha ;
Namboodiri, Vinay ;
Jawahar, C. V. .
2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, :5167-5176
[8]   Audio-Visual Multimedia Quality Assessment A Comprehensive Survey [J].
Akhtar, Zahid ;
Falk, Tiago H. .
IEEE ACCESS, 2017, 5 :21090-21117
[9]  
[Anonymous], 2004, Proceedings of the 6th international conference on Multimodal interfaces
[10]  
[Anonymous], 2000, Advances in Neural Information Processing Systems