An interactive network based on transformer for multimodal crowd counting

被引：0

作者：

Ying Yu

Zhen Cai

Duoqian Miao

Jin Qian

Hong Tang

机构：

[1] College of Software Engineering,

[2] Department of Computer Science and Technology,undefined

来源：

Applied Intelligence | 2023年 / 53卷

关键词：

Crowd counting; Transformer; Multimodal data; Feature fusion;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Crowd counting is a task to estimate the total number of pedestrians in an image. In most of the existing research, good vision problems, such as in parks, squares, and bright shopping malls during the day, have been addressed. However, there is little research on complex scenes in darkness. To study this problem, we propose an interactive network based on Transformer for multi-modal crowd counting. First, sliding convolutional encoding is adopted for the image to obtain better encoding features. The features are extracted through the designed primary interaction network, and then channel token attention is used to modulate the features. Then, the FGAF-MLP is used for high and low semantic fusion to enhance the feature expression and fully fuse the data in different modes to improve the accuracy of the method. To verify the effectiveness of our method, we conducted extensive ablation experiments with the latest multimodal benchmark RGBT-CC, and we verified the complementarity between multiple modal data and the effectiveness of the model components. We also verified the effectiveness of our method with the ShanghaiTechRGBD benchmark. The experimental results showed that our proposed method exhibits good results and achieves an improvement of more than 10%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document} in terms of the mean average error and mean squared error for the RGBT-CC benchmark.

引用

页码：22602 / 22614

页数：12

共 39 条

[1]

Fan Z(2022)A survey of crowd counting and density estimation based on convolutional neural network Neurocomputing 472 224-251

[2]

Zhang H(2022)Scene-specific crowd counting using synthetic training images Pattern Recog 124 3548-3560

[3]

Zhang Z(2021)Clothing fashion style recognition with design issue graph Appl Intell 51 2278-2324

[4]

Delussu R(1998)Gradient-based learning applied to document recognition Proc IEEE 86 931-942

[5]

Putzu L(2021)Dense crowd counting based on adaptive scene division Int J Mach Learn Cybern 12 2070-2091

[6]

Fumera G(2022)YOLOv3-MT: A YOLOv3 using multi-target tracking for vehicle visual detection Appl Intell 52 1825-1837

[7]

Yue X(2022)Pyramid-dilated deep convolutional neural network for crowd counting Appl Intell 52 852-860

[8]

Zhang C(2022)TransFG: A transformer architecture for fine-grained recognition Proc AAAI Conf Artif Intel. 36 15908-15919

[9]

Fujita H(2021)Transformer in transformer Adv Neural Inf Process Syst 34 160104-103

[10]

Lecun Y(2022)Transcrowd: weakly-supervised crowd counting with transformers Sci China Inf Sci 65 94-117

← 1 2 3 4 →