Hierarchical Visual-Textual Graph for Temporal Activity Localization via Language

被引:23
作者
Chen, Shaoxiang [1 ]
Jiang, Yu-Gang [1 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai Key Lab Intelligent Informat Proc, Shanghai, Peoples R China
来源
COMPUTER VISION - ECCV 2020, PT XX | 2020年 / 12365卷
关键词
Temporal Activity Localization via Language; Hierarchical Visual-Textual Graph; Visual-textual alignment;
D O I
10.1007/978-3-030-58565-5_36
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Temporal Activity Localization via Language (TALL) in video is a recently proposed challenging vision task, and tackling it requires fine-grained understanding of the video content, however, this is overlooked by most of the existing works. In this paper, we propose a novel TALL method which builds a Hierarchical Visual-Textual Graph to model interactions between the objects and words as well as among the objects to jointly understand the video contents and the language. We also design a convolutional network with cross-channel communication mechanism to further encourage the information passing between the visual and textual modalities. Finally, we propose a loss function that enforces alignment of the visual representation of the localized activity and the sentence representation, so that the model can predict more accurate temporal boundaries. We evaluated our proposed method on two popular benchmark datasets: Charades-STA and ActivityNet Captions, and achieved state-of-the-art performances on both datasets. Code is available at https://github.com/forwchen/HVTG.
引用
收藏
页码:601 / 618
页数:18
相关论文
共 58 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]  
[Anonymous], 2015, ICMR
[3]  
[Anonymous], 2014, NIPS 2014 WORKSH DEE
[4]  
Ba JL, 2016, arXiv
[5]   SST: Single-Stream Temporal Action Proposals [J].
Buch, Shyamal ;
Escorcia, Victor ;
Shen, Chuanqi ;
Ghanem, Bernard ;
Niebles, Juan Carlos .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6373-6382
[6]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[7]  
Chen JY, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P162
[8]  
Chen SX, 2019, AAAI CONF ARTIF INTE, P8199
[9]   Temporal Context Network for Activity Localization in Videos [J].
Dai, Xiyang ;
Singh, Bharat ;
Zhang, Guyue ;
Davis, Larry S. ;
Chen, Yan Qiu .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5727-5736
[10]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497