Cross-modal multi-headed attention for long multimodal conversations

被引:1
作者
Belagur, Harshith [1 ]
Reddy, N. Saketh [1 ]
Krishna, P. Radha [1 ]
Tumuluri, Raj [2 ]
机构
[1] Natl Inst Technol Warangal, Dept Comp Sci & Engn, Warangal, India
[2] Openstream Inc, Somerset, NJ USA
关键词
Conversational AI; Multimodality; Natural Language Processing; Computer Vision; Deep Learning; Fashion Domain;
D O I
10.1007/s11042-023-15606-4
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Most Conversational AI agents in today's marketplace are unimodal in which only text is exchanged between the user and the bot. However, employing additional modes (e.g., image) in the interaction improves customer experience, potentially increasing efficiency and profits in applications such as online shopping. Most of the existing techniques have used feature extraction from the multimodal inputs, but very few works used multi-headed attention from transformers conversational AI. In this work, we propose a novel architecture called Cross-modal Multi-headed Hierarchical Encoder-Decoder with Sentence Embeddings (CMHRED-SE) to enhance the quality of natural language response by better understanding features such as color, sentence structure, and continuity of the conversation. CMHRED-SE uses multi-headed attention and image representations from VGGNet19 and ResNet50 architectures to improve the effectiveness in fashion domain-specific conversations. The results of CMHRED-SE are compared with two other similar models, namely M-HRED and MHRED-attn, and the quality of answers returned by the models are evaluated using BLEU-4, ROUGE-L, and the Cosine scores. The evaluation results show an improvement of 5% for Cosine similarity, 9% for ROUGE-L F1 score, and 11% for the BLEU-4 score over the scores returned by the baseline models. The results also show that our approach better understands and generates clearer textual responses leveraging the sentence embeddings.
引用
收藏
页码:45679 / 45697
页数:19
相关论文
共 39 条
[1]  
Agarwal S, 2018, P 11 INT C NAT LANG, ppp129
[2]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[3]   Learning visual similarity for product design with convolutional neural networks [J].
Bell, Sean ;
Bala, Kavita .
ACM TRANSACTIONS ON GRAPHICS, 2015, 34 (04)
[4]  
Bojanowski P., 2016, VALENCIA SPAIN ACL, V2, P427
[5]  
Bojanowski P, 2017, Transactions of the Association for Computational Linguistics, V5, P135, DOI [DOI 10.1162/TACL_A_00051, 10.1162/tacla00051]
[6]  
Chauhan H, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P5437
[7]   New Ideas and Trends in Deep Multimodal Content Understanding: A Review [J].
Chen, Wei ;
Wang, Weiping ;
Liu, Li ;
Lew, Michael S. .
NEUROCOMPUTING, 2021, 426 :195-215
[8]   Visual Dialog [J].
Das, Abhishek ;
Kottur, Satwik ;
Gupta, Khushi ;
Singh, Avi ;
Yadav, Deshraj ;
Moura, Jose M. F. ;
Parikh, Devi ;
Batra, Dhruv .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1080-1089
[9]   GuessWhat?! Visual object discovery through multi-modal dialogue [J].
de Vries, Harm ;
Strub, Florian ;
Chandar, Sarath ;
Pietquin, Olivier ;
Larochelle, Hugo ;
Courville, Aaron .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4466-4475
[10]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171