Audio-visual self-supervised representation learning: A survey

被引:0
作者
Alsuwat, Manal [1 ]
Al-Shareef, Sarah [1 ]
Alghamdi, Manal [1 ]
机构
[1] Umm Al Qura Univ, Dept Comp Sci & Artificial Intelligence, Mecca, Saudi Arabia
关键词
Multimodal; Self-supervised learning; Deep learning; Pretext tasks; Data representation; Audio-visual learning; AUDIO; RECOGNITION; RETRIEVAL;
D O I
10.1016/j.neucom.2025.129750
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Artificial intelligence developers leverage the inherent relationships among video, text, and audio to create enhanced representations of the world, mirroring the way humans use multiple senses to understand their environment. As such, multimodal learning, which integrates various data input modalities to augment the learning of intrinsic features, has been gaining traction. While applications in multimodal understanding have made strides with deep learning, they often rely heavily on supervised learning and extensive human annotation. This paper provides a comprehensive review of audio-visual self-supervised learning, a promising alternative that uses vast amounts of unlabeled data. It holds the potential to reshape areas like computer vision, and speech recognition. We begin by explaining the concept of audio-visual modalities in machine learning and then move into their role within self-supervised learning by discussing terminology, general pipelines, and underlying motivations. This is followed by an exploration of common pretext tasks in audio- visual self-supervised learning, along with the evaluation methods, datasets, and downstream tasks associated with it. We then highlight prevailing challenges in both audio-visual and self-supervised learning realms. The paper concludes by presenting open challenges, suggesting avenues for future research in this dynamic domain.
引用
收藏
页数:21
相关论文
共 201 条
[1]   SpeakingFaces: A Large-Scale Multimodal Dataset of Voice Commands with Visual and Thermal Video Streams [J].
Abdrakhmanova, Madina ;
Kuzdeuov, Askat ;
Jarju, Sheikh ;
Khassanov, Yerbolat ;
Lewis, Michael ;
Varol, Huseyin Atakan .
SENSORS, 2021, 21 (10)
[2]   Multimodal Video Sentiment Analysis Using Deep Learning Approaches, a Survey [J].
Abdu, Sarah A. ;
Yousef, Ahmed H. ;
Salem, Ashraf .
INFORMATION FUSION, 2021, 76 :204-226
[3]   Self-supervised object detection from audio-visual correspondence [J].
Afouras, Triantafyllos ;
Asano, Yuki M. ;
Fagan, Francois ;
Vedaldi, Andrea ;
Metze, Florian .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :10565-10576
[4]   Self-supervised Learning of Audio-Visual Objects from Video [J].
Afouras, Triantafyllos ;
Owens, Andrew ;
Chung, Joon Son ;
Zisserman, Andrew .
COMPUTER VISION - ECCV 2020, PT XVIII, 2020, 12363 :208-224
[5]   Audio-visual biometrics [J].
Aleksic, Petar S. ;
Katsaggelos, Aggelos K. .
PROCEEDINGS OF THE IEEE, 2006, 94 (11) :2025-2044
[6]  
Alwassel H, 2020, ADV NEUR IN, V33
[7]  
Anand D, 2020, I S BIOMED IMAGING, P1159, DOI [10.1109/ISBI45749.2020.9098369, 10.1109/isbi45749.2020.9098369]
[8]   Objects that Sound [J].
Arandjelovic, Relja ;
Zisserman, Andrew .
COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 :451-466
[9]   Look, Listen and Learn [J].
Arandjelovic, Relja ;
Zisserman, Andrew .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :609-617
[10]  
Asano YM, 2020, ADV NEUR IN, V33