GDOD: Effective Gradient Descent using Orthogonal Decomposition for Multi-Task Learning

被引:2
作者
Dong, Xin [1 ]
Wu, Ruize [2 ]
Xiong, Chao [1 ]
Li, Hai [1 ]
Cheng, Lei [2 ]
He, Yong [2 ]
Qian, Shiyou [3 ]
Cao, Jian [3 ]
Mo, Linjian [1 ]
机构
[1] Ant Grp, Shanghai, Peoples R China
[2] Ant Grp, Hangzhou, Peoples R China
[3] Shanghai Jiao Tong Univ, Shanghai, Peoples R China
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022 | 2022年
关键词
multi-task learning; orthogonal decomposition; gradient conflict;
D O I
10.1145/3511808.3557333
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Multi-task learning (MTL) aims at solving multiple related tasks simultaneously and has experienced rapid growth in recent years. However, MTL models often suffer from performance degeneration with negative transfer due to learning several tasks simultaneously. Some related work attributed the source of the problem is the conflicting gradients. In this case, it is needed to select useful gradient updates for all tasks carefully. To this end, we propose a novel optimization approach for MTL, named GDOD, which manipulates gradients of each task using an orthogonal basis decomposed from the span of all task gradients. GDOD decomposes gradients into task-shared and task-conflict components explicitly and adopts a general update rule for avoiding interference across all task gradients. This allows guiding the update directions depending on the task-shared components. Moreover, we prove the convergence of GDOD theoretically under both convex and non-convex assumptions. Experiment results on several multi-task datasets not only demonstrate the significant improvement of GDOD performed to existing MTL models but also prove that our algorithm outperforms state-of-the-art optimization methods in terms of AUC and Logloss metrics.
引用
收藏
页码:386 / 395
页数:10
相关论文
共 36 条
[1]  
[Anonymous], 2008, P 25 INT C MACH LEAR, DOI DOI 10.1145/1390156.1390177
[2]   Multitask learning [J].
Caruana, R .
MACHINE LEARNING, 1997, 28 (01) :41-75
[3]  
Chen Z, 2018, PR MACH LEARN RES, V80
[4]  
Cheng H. T., 2016, P 1 WORKSH DEEP LEAR, P7
[5]   Multiple-gradient descent algorithm (MGDA) for multiobjective optimization [J].
Desideri, Jean-Antoine .
COMPTES RENDUS MATHEMATIQUE, 2012, 350 (5-6) :313-318
[6]  
Du YS, 2020, Arxiv, DOI arXiv:1812.02224
[7]  
Freiburg Ziegler C.N., 2004, BookCrossing dataset
[8]  
Guo HF, 2017, Arxiv, DOI [arXiv:1703.04247, 10.48550/arXiv.1703.04247,arXiv,abs/1703.04247, DOI 10.48550/ARXIV.1703.04247,ARXIV,ABS/1703.04247]
[9]   Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions [J].
Halko, N. ;
Martinsson, P. G. ;
Tropp, J. A. .
SIAM REVIEW, 2011, 53 (02) :217-288
[10]   Adaptive Mixtures of Local Experts [J].
Jacobs, Robert A. ;
Jordan, Michael I. ;
Nowlan, Steven J. ;
Hinton, Geoffrey E. .
NEURAL COMPUTATION, 1991, 3 (01) :79-87