Vision-Language Recommendation via Attribute Augmented Multimodal Reinforcement Learning

被引:25
作者
Yu, Tong [1 ]
Shen, Yilin [1 ]
Zhang, Ruiyi [1 ,2 ]
Zeng, Xiangyu [1 ,3 ]
Jin, Hongxia [1 ]
机构
[1] Samsung Res Amer, Mountain View, CA 94043 USA
[2] Duke Univ, Durham, NC USA
[3] Columbia Univ, New York, NY USA
来源
PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19) | 2019年
关键词
interactive recommender system; multimodal; vision and language; reinforcement learning; IMAGE RETRIEVAL;
D O I
10.1145/3343031.3350935
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Interactive recommenders have demonstrated the advantage over traditional recommenders with dynamic change of items. However, the traditional user feedback in the format of clicks or ratings, provides limited user preference information and limited history tracking capabilities. As a result, it takes a user many interactions to find a desired item. Data of other modalities, such as item visual appearance and user comments in natural language, may enable richer user feedback. However, there are several critical challenges to be addressed when utilizing these multimodal data: multimodal matching, user preference tracking, and adaptation to dynamic unseen items. Without properly handling these challenges, the recommendations can easily violate the users' preference from their past natural language feedback. In this paper, we introduce a novel approach, called vision-language recommendation, that enables users to provide natural language feedback on visual products to have more natural and effective interactions. To model more explicit and accurate multimodal matching, we propose a novel visual attribute augmented reinforcement learning approach that enhances the grounding of natural language to visual items. Furthermore, to effectively track the users' preference and overcome the performance deficiency on dynamic unseen items after deployment, we propose a novel history multimodal matching reward to continuously adapt the model on-the-fly. Empirical results show that, our system augmented by visual attribute and history multimodal matching can significantly increase the success rate, reduce the number of recommendations that violate the user's previous feedback, and need less number of user interactions to find the desired items.
引用
收藏
页码:39 / 47
页数:9
相关论文
共 54 条
[1]  
[Anonymous], 2014, CVPR
[2]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[3]   Saying What You're Looking For: Linguistics Meets Video Search [J].
Barrett, Daniel Paul ;
Barbu, Andrei ;
Siddharth, N. ;
Siskind, Jeffrey Mark .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2016, 38 (10) :2069-2081
[4]   CASES, SCRIPTS, AND INFORMATION-SEEKING STRATEGIES - ON THE DESIGN OF INTERACTIVE INFORMATION-RETRIEVAL SYSTEMS [J].
BELKIN, NJ ;
COOL, C ;
STEIN, A ;
THIEL, U .
EXPERT SYSTEMS WITH APPLICATIONS, 1995, 9 (03) :379-395
[5]  
Berg TL, 2010, LECT NOTES COMPUT SC, V6311, P663, DOI 10.1007/978-3-642-15549-9_48
[6]  
Chang S.F, 2016, P 2016 ACM MULT C
[7]  
Chapelle Olivier, 2011, Advances in Neural Information Processing Systems, V24
[8]   UbiShop: Commercial item recommendation using visual part-based object representation [J].
Chi, Heng-Yu ;
Chen, Chun-Chieh ;
Cheng, Wen-Huang ;
Chen, Ming-Syan .
MULTIMEDIA TOOLS AND APPLICATIONS, 2016, 75 (23) :16093-16115
[9]  
Cho K, 2014, ARXIV14061078
[10]   Towards Conversational Recommender Systems [J].
Christakopoulou, Konstantina ;
Radlinski, Filip ;
Hofmann, Katja .
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :815-824