Adversarial Bipartite Graph Learning for Video Domain Adaptation

被引：26

作者：

Luo, Yadan ^{[1
]}

Huang, Zi ^{[1
]}

Wang, Zijian ^{[1
]}

Zhang, Zheng ^{[2
,3
]}

Baktashmotlagh, Mahsa ^{[1
]}

机构：

[1] Univ Queensland, Brisbane, Qld, Australia

[2] Peng Cheng Lab, Shenzhen, Peoples R China

[3] Harbin Inst Technol, Biocomp Res Ctr, Shenzhen, Peoples R China

来源：

MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA | 2020年

关键词：

Video Action Recognition; Domain Adaptation;

D O I：

10.1145/3394171.3413897

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Domain adaptation techniques, which focus on adapting models between distributionally different domains, are rarely explored in the video recognition area due to the significant spatial and temporal shifts across the source (i.e. training) and target (i.e. test) domains. As such, recent works on visual domain adaptation which leverage adversarial learning to unify the source and target video representations and strengthen the feature transferability are not highly e!ective on the videos. To overcome this limitation, in this paper, we learn a domain-agnostic video classifier instead of learning domain-invariant representations, and propose an Adversarial Bipartite Graph (ABG) learning framework which directly models the source-target interactions with a network topology of the bipartite graph. Specifically, the source and target frames are sampled as heterogeneous vertexes while the edges connecting two types of nodes measure the affinity among them. Through message-passing, each vertex aggregates the features from its heterogeneous neighbors, forcing the features coming from the same class to be mixed evenly. Explicitly exposing the video classifier to such cross-domain representations at the training and test stages makes our model less biased to the labeled source data, which in-turn results in achieving a better generalization on the target domain. The proposed framework is agnostic to the choices of frame aggregation, and therefore, four di!erent aggregation functions are investigated for capturing appearance and temporal dynamics. To further enhance the model capacity and testify the robustness of the proposed architecture on difficult transfer tasks, we extend our model to work in a semi-supervised setting using an additional video-level bipartite graph. Extensive experiments conducted on four benchmark datasets evidence the effectiveness of the proposed approach over the state-of-the-art methods on the task of video recognition.

引用

页码：19 / 27

页数：9

共 49 条

[1] Domain Adaptation on the Statistical Manifold [J].

Baktashmotlagh, Mahsa ;

Harandi, Mehrtash T. ;

Lovell, Brian C. ;

Salzmann, Mathieu .

2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :2481-2488

[2]

Chen Min-Hung, 2019, ABS190712743 CORR

[3]

Chen Z, 2020, IEEE WINT CONF APPL, P863, DOI [10.1109/wacv45572.2020.9093610, 10.1109/WACV45572.2020.9093610]

[4] Long-Term Recurrent Convolutional Networks for Visual Recognition and Description [J].

Donahue, Jeff ;

Hendricks, Lisa Anne ;

Rohrbach, Marcus ;

Venugopalan, Subhashini ;

Guadarrama, Sergio ;

Saenko, Kate ;

Darrell, Trevor .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (04) :677-691

[5] Learning Spatiotemporal Features with 3D Convolutional Networks [J].

Du Tran ;

Bourdev, Lubomir ;

Fergus, Rob ;

Torresani, Lorenzo ;

Paluri, Manohar .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497

[6]

Duan X, 2018, ADV NEUR IN, V31

[7]

Ganin Y, 2016, J MACH LEARN RES, V17

[8]

Gong B., 2013, Proceedings of The 30th International Conference on Machine Learning, P222, DOI DOI 10.5555/3042817.3042844

[9]

Gretton A, 2012, J MACH LEARN RES, V13, P723

[10]

Hamilton WL, 2017, ADV NEUR IN, V30

← 1 2 3 4 5 →