Increasing Importance of Joint Analysis of Audio and Video in Computer Vision: A Survey

被引：2

作者：

Shahabaz, Ahmed ^{[1
]}

Sarkar, Sudeep ^{[1
]}

机构：

[1] Univ S Florida, Dept Comp Sci & Engn, Tampa, FL 33620 USA

来源：

IEEE ACCESS | 2024年 / 12卷

基金：

美国国家科学基金会;

关键词：

Task analysis; Visualization; Deep learning; Surveys; Reviews; Location awareness; Hidden Markov models; Computer vision; Audio-visual systems; Multisensory integration; audio-video analysis; contrastive learning; multi-modal analysis; AUDIOVISUAL AFFECT RECOGNITION; EMOTION RECOGNITION; FUSION; SYNCHRONIZATION; SEGMENTATION; SPEECH; SOUND; CLASSIFICATION; EXPRESSIONS; FEATURES;

D O I：

10.1109/ACCESS.2024.3391817

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The joint analysis of audio and video is a powerful tool that can be applied to various contexts, including action, speech, and sound recognition, audio-visual video parsing, emotion recognition in affective computing, and self-supervised training of deep learning models. Solving these problems often involves tackling core audio-visual tasks, such as audio-visual source localization, audio-visual correspondence, and audio-visual source separation, which can be combined in various ways to achieve the desired results. This paper provides a review of the literature in this area, discussing the advancements, history, and datasets of audio-visual learning methods for various application domains. It also presents an overview of the reported performances on standard datasets and suggests promising directions for future research.

引用

页码：59399 / 59430

页数：32

共 291 条

[1] A Novel Real-Time, Lightweight Chaotic-Encryption Scheme for Next-Generation Audio-Visual Hearing Aids [J].

Adeel, Ahsan ;

Ahmad, Jawad ;

Larijani, Hadi ;

Hussain, Amir .

COGNITIVE COMPUTATION, 2020, 12 (03) :589-601

[2] Deep Audio-Visual Speech Recognition [J].

Afouras, Triantafyllos ;

Chung, Joon Son ;

Senior, Andrew ;

Vinyals, Oriol ;

Zisserman, Andrew .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) :8717-8727

[3]

Afouras T, 2022, Arxiv, DOI [arXiv:2104.06401, 10.48550/ARXIV.2104.06401, DOI 10.48550/ARXIV.2104.06401]

[4] Self-supervised object detection from audio-visual correspondence [J].

Afouras, Triantafyllos ;

Asano, Yuki M. ;

Fagan, Francois ;

Vedaldi, Andrea ;

Metze, Florian .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :10565-10576

[5]

Afouras T, 2018, Arxiv, DOI arXiv:1809.00496

[6] Self-supervised Learning of Audio-Visual Objects from Video [J].

Afouras, Triantafyllos ;

Owens, Andrew ;

Chung, Joon Son ;

Zisserman, Andrew .

COMPUTER VISION - ECCV 2020, PT XVIII, 2020, 12363 :208-224

[7] Audio-Visual Face Reenactment [J].

Agarwal, Madhav ;

Mukhopadhyay, Rudrabha ;

Namboodiri, Vinay ;

Jawahar, C. V. .

2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, :5167-5176

[8] Audio-Visual Multimedia Quality Assessment A Comprehensive Survey [J].

Akhtar, Zahid ;

Falk, Tiago H. .

IEEE ACCESS, 2017, 5 :21090-21117

[9]

[Anonymous], 2004, Proceedings of the 6th international conference on Multimodal interfaces

[10]

[Anonymous], 2000, Advances in Neural Information Processing Systems

← 1 2 3 4 5 6 7 8 9 10 →