Caption Alignment for Low Resource Audio-Visual Data

被引:1
作者
Konda, Vighnesh Reddy [1 ]
Warialani, Mayur [1 ]
Achari, Rakesh Prasanth [1 ]
Bhatnagar, Varad [1 ]
Akula, Jayaprakash [1 ]
Jyothi, Preethi [1 ]
Ramakrishnan, Ganesh [1 ]
Haffari, Gholamreza [2 ]
Singh, Pankaj [1 ]
机构
[1] Indian Inst Technol, Mumbai, Maharashtra, India
[2] Monash Univ, Clayton, Vic, Australia
来源
INTERSPEECH 2020 | 2020年
关键词
multimodal models; low-resource audio-visual corpus; caption alignment for videos;
D O I
10.21437/Interspeech.2020-3157
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Understanding videos via captioning has gained a lot of traction recently. While captions are provided alongside videos, the information about where a caption aligns within a video is missing, which could be particularly useful for indexing and retrieval. Existing work on learning to infer alignments has mostly exploited visual features and ignored the audio signal. Video understanding applications often underestimate the importance of the audio modality. We focus on how to make effective use of the audio modality for temporal localization of captions within videos. We release a new audio-visual dataset that has captions time-aligned by (i) carefully listening to the audio and watching the video, and (ii) watching only the video. Our dataset is audio-rich and contains captions in two languages, English and Marathi (a low-resource language). We further propose an attention-driven multimodal model, for effective utilization of both audio and video for temporal localization. We then investigate (i) the effects of audio in both data preparation and model design, and (ii) effective pretraining strategies (Audioset, ASR-bottleneck features, PASE, etc.) handling low-resource setting to help extract rich audio representations.
引用
收藏
页码:3525 / 3529
页数:5
相关论文
共 21 条
[1]  
[Anonymous], 2016, PROC 24 ACM INT C MU, DOI [DOI 10.1145/2964284.2984065, 10.1145/2964284.2984065]
[2]  
[Anonymous], 2016, P 24 ACM INT C MULTI, DOI DOI 10.1145/2964284.2984066
[3]  
[Anonymous], 2020, MULTITASK SELF SUPER
[4]  
Chen JY, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P162
[5]  
Gao J., 2017, CoRR
[6]  
Gemmeke JF, 2017, INT CONF ACOUST SPEE, P776, DOI 10.1109/ICASSP.2017.7952261
[7]  
He D., 2019, CORR
[8]  
Hendricks L. A., 2018, CORR, V1809
[9]  
Hershey S, 2017, INT CONF ACOUST SPEE, P131, DOI 10.1109/ICASSP.2017.7952132
[10]  
Hori C, 2017, 2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), P430, DOI 10.1109/ASRU.2017.8268968