An interactive network based on transformer for multimodal crowd counting

被引:6
作者
Yu, Ying [1 ]
Cai, Zhen [1 ]
Miao, Duoqian [2 ]
Qian, Jin [1 ]
Tang, Hong [1 ]
机构
[1] East China Jiao tong Univ, Coll Software Engn, Nanchang 330013, Peoples R China
[2] Tongji Univ, Dept Comp Sci & Technol, Shanghai 201804, Peoples R China
基金
中国国家自然科学基金;
关键词
Crowd counting; Transformer; Multimodal data; Feature fusion;
D O I
10.1007/s10489-023-04721-2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Crowd counting is a task to estimate the total number of pedestrians in an image. In most of the existing research, good vision problems, such as in parks, squares, and bright shopping malls during the day, have been addressed. However, there is little research on complex scenes in darkness. To study this problem, we propose an interactive network based on Transformer for multi-modal crowd counting. First, sliding convolutional encoding is adopted for the image to obtain better encoding features. The features are extracted through the designed primary interaction network, and then channel token attention is used to modulate the features. Then, the FGAF-MLP is used for high and low semantic fusion to enhance the feature expression and fully fuse the data in different modes to improve the accuracy of the method. To verify the effectiveness of our method, we conducted extensive ablation experiments with the latest multimodal benchmark RGBT-CC, and we verified the complementarity between multiple modal data and the effectiveness of the model components. We also verified the effectiveness of our method with the ShanghaiTechRGBD benchmark. The experimental results showed that our proposed method exhibits good results and achieves an improvement of more than 10% in terms of the mean average error and mean squared error for the RGBT-CC benchmark.
引用
收藏
页码:22602 / 22614
页数:13
相关论文
共 50 条
[1]   Attention-based aspect sentiment classification using enhanced learning through CNN-BiLSTM networks [J].
Ayetiran, Eniafe Festus .
KNOWLEDGE-BASED SYSTEMS, 2022, 252
[2]   Multimodal Machine Learning: A Survey and Taxonomy [J].
Baltrusaitis, Tadas ;
Ahuja, Chaitanya ;
Morency, Louis-Philippe .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) :423-443
[3]   SIMCD: SIMulated crowd data for anomaly detection and prediction [J].
Bamaqa, Amna ;
Sedky, Mohamed ;
Bosakowski, Tomasz ;
Bastaki, Benhur Bakhtiari ;
Alshammari, Nasser O. .
EXPERT SYSTEMS WITH APPLICATIONS, 2022, 203
[4]   Scale Aggregation Network for Accurate and Efficient Crowd Counting [J].
Cao, Xinkun ;
Wang, Zhipeng ;
Zhao, Yanyun ;
Su, Fei .
COMPUTER VISION - ECCV 2018, PT V, 2018, 11209 :757-773
[5]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[6]   Bayesian Poisson Regression for Crowd Counting [J].
Chan, Antoni B. ;
Vasconcelos, Nuno .
2009 IEEE 12TH INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2009, :545-551
[7]   Histograms of oriented gradients for human detection [J].
Dalal, N ;
Triggs, B .
2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, :886-893
[8]   Scene-specific crowd counting using synthetic training images [J].
Delussu, Rita ;
Putzu, Lorenzo ;
Fumera, Giorgio .
PATTERN RECOGNITION, 2022, 124
[9]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[10]  
Dosovitskiy Alexey., 2021, PROC INT C LEARN REP, P2021