Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts

被引:784
作者
Ma, Jiaqi [1 ,2 ]
Zhao, Zhe [2 ]
Yi, Xinyang [2 ]
Chen, Jilin [2 ]
Hong, Lichan [2 ]
Chi, Ed H. [2 ]
机构
[1] Univ Michigan, Sch Informat, Ann Arbor, MI 48109 USA
[2] Google Inc, Mountain View, CA 94043 USA
来源
KDD'18: PROCEEDINGS OF THE 24TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING | 2018年
关键词
multi-task learning; mixture of experts; neural network; recommendation system;
D O I
10.1145/3219819.3220007
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Neural-based multi-task learning has been successfully used in many real-world large-scale applications such as recommendation systems. For example, in movie recommendations, beyond providing users movies which they tend to purchase and watch, the system might also optimize for users liking the movies afterwards. With multi-task learning, we aim to build a single model that learns these multiple goals and tasks simultaneously. However, the prediction quality of commonly used multi-task models is often sensitive to the relationships between tasks. It is therefore important to study the modeling tradeoffs between task-specific objectives and inter-task relationships. In this work, we propose a novel multi-task learning approach, Multi-gate Mixture-of-Experts (MMoE), which explicitly learns to model task relationships from data. We adapt the Mixture-of Experts (MoE) structure to multi-task learning by sharing the expert submodels across all tasks, while also having a gating network trained to optimize each task. To validate our approach on data with different levels of task relatedness, we first apply it to a synthetic dataset where we control the task relatedness. We show that the proposed approach performs better than baseline methods when the tasks are less related. We also show that the MMoE structure results in an additional trainability benefit, depending on different levels of randomness in the training data and model initialization. Furthermore, we demonstrate the performance improvements by MMoE on real tasks including a binary classification benchmark, and a large-scale content recommendation system at Google.
引用
收藏
页码:1930 / 1939
页数:10
相关论文
共 35 条
[1]  
Abadi M., 2016, TENSORFLOW LARGESCAL
[2]  
[Anonymous], 2013, ARXIV
[3]  
[Anonymous], 2017, PathNet: Evolution Channels Gradient Descent in Super Neural Networks
[4]  
Asuncion A., 2007, UCI MACHINE LEARNING
[5]   Ask the GRU: Multi-task Learning for Deep Text Recommendations [J].
Bansal, Trapit ;
Belanger, David ;
McCallum, Andrew .
PROCEEDINGS OF THE 10TH ACM CONFERENCE ON RECOMMENDER SYSTEMS (RECSYS'16), 2016, :107-114
[6]   A model of inductive bias learning [J].
Baxter, J .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2000, 12 :149-198
[7]   Exploiting task relatedness for multiple task learning [J].
Ben-David, S ;
Schuller, R .
LEARNING THEORY AND KERNEL MACHINES, 2003, 2777 :567-580
[8]  
Ben-David Shai, 2002, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, P443
[9]  
Caruana R, 1998, LEARNING TO LEARN, P95, DOI 10.1007/978-1-4615-5529-2_5
[10]  
Caruna R., 1993, INT C MACH LEARN, P41, DOI 10.1016/b978-1-55860-307-3.50012-5