An Empirical Study of Self-Supervised Learning with Wasserstein Distance

被引:0
作者
Yamada, Makoto [1 ,2 ]
Takezawa, Yuki [1 ,3 ]
Houry, Guillaume [1 ,4 ]
Dusterwald, Kira Michaela [1 ,5 ]
Sulem, Deborah [6 ]
Zhao, Han [7 ]
Tsai, Yao-Hung [1 ,8 ]
机构
[1] Okinawa Inst Sci & Technol, Machine Learning & Data Sci Unit, Okinawa 9040412, Japan
[2] Ctr Adv Intelligence Project RIKEN, Tokyo 1030027, Japan
[3] Kyoto Univ, Dept Intelligence Sci & Technol, Kyoto 6068501, Japan
[4] Paris Saclay Ecole Normale Super, F-75005 Paris, France
[5] UCL, Gatsby Computat Neurosci Unit, London WC1E 6BT, England
[6] Univ Pompeu Fabra, Barcelona Sch Econ, Barcelona 08002, Spain
[7] Univ Illinois, Dept Comp Sci, Champaign, IL 61801 USA
[8] Carnegie Mellon Univ, Sch Comp Sci, Machine Learning Dept, Pittsburgh, PA 15213 USA
关键词
optimal transport; Wasserstein distance; self-supervised learning;
D O I
10.3390/e26110939
中图分类号
O4 [物理学];
学科分类号
0702 ;
摘要
In this study, we consider the problem of self-supervised learning (SSL) utilizing the 1-Wasserstein distance on a tree structure (a.k.a., Tree-Wasserstein distance (TWD)), where TWD is defined as the L1 distance between two tree-embedded vectors. In SSL methods, the cosine similarity is often utilized as an objective function; however, it has not been well studied when utilizing the Wasserstein distance. Training the Wasserstein distance is numerically challenging. Thus, this study empirically investigates a strategy for optimizing the SSL with the Wasserstein distance and finds a stable training procedure. More specifically, we evaluate the combination of two types of TWD (total variation and ClusterTree) and several probability models, including the softmax function, the ArcFace probability model, and simplicial embedding. We propose a simple yet effective Jeffrey divergence-based regularization method to stabilize optimization. Through empirical experiments on STL10, CIFAR10, CIFAR100, and SVHN, we find that a simple combination of the softmax function and TWD can obtain significantly lower results than the standard SimCLR. Moreover, a simple combination of TWD and SimSiam fails to train the model. We find that the model performance depends on the combination of TWD and probability model, and that the Jeffrey divergence regularization helps in model training. Finally, we show that the appropriate combination of the TWD and probability model outperforms cosine similarity-based representation learning.
引用
收藏
页数:17
相关论文
共 60 条
[11]   Exploring Simple Siamese Representation Learning [J].
Chen, Xinlei ;
He, Kaiming .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :15745-15753
[12]  
Cover JA Thomas T.M., 2012, ELEMENTS INFORM THEO
[13]  
Cuturi M., 2013, ADV NEURAL INFORM PR, V26, DOI DOI 10.48550/ARXIV.1306.0895
[14]   ArcFace: Additive Angular Margin Loss for Deep Face Recognition [J].
Deng, Jiankang ;
Guo, Jia ;
Xue, Niannan ;
Zafeiriou, Stefanos .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :4685-4694
[15]   Max-Sliced Wasserstein Distance and its use for GANs [J].
Deshpande, Ishan ;
Hu, Yuan-Ting ;
Sun, Ruoyu ;
Pyrros, Ayis ;
Siddiqui, Nasir ;
Koyejo, Sanmi ;
Zhao, Zhizhen ;
Forsyth, David ;
Schwing, Alexander .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :10640-10648
[16]  
Dey TK, 2022, Algorithm Eng Exp, P169
[17]   The phylogenetic Kantorovich-Rubinstein metric for environmental sequence samples [J].
Evans, Steven N. ;
Matsen, Frederick A. .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2012, 74 :569-592
[19]  
Frogner Charlie, 2015, ADV NEURAL INFORM PR, V2, P2053
[20]  
Gretton A, 2005, LECT NOTES ARTIF INT, V3734, P63