Attention-Based 3D-CNNs for Large-Vocabulary Sign Language Recognition

被引：132

作者：

Huang, Jie ^{[1
]}

Zhou, Wengang ^{[1
]}

Li, Houqiang ^{[1
]}

Li, Weiping ^{[1
]}

机构：

[1] Univ Sci & Technol China, Dept Elect Engn & Informat Sci, Hefei 230027, Anhui, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2019年 / 29卷 / 09期

关键词：

Sign language recognition; 3D convolutional neural networks; attention mechanism; deep learning; SYSTEM; MODEL;

D O I：

10.1109/TCSVT.2018.2870740

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Sign language recognition (SLR) is an important and challenging research topic in the multimedia field. Conventional techniques for SLR rely on hand-crafted features, which achieve limited success. In this paper, we present attention-based 3D-convolutional neural networks (3D-CNNs) for SLR. The framework has two advantages: 3D-CNNs learn spatio-temporal features from raw video without prior knowledge and the attention mechanism helps to select the clue. When training 3D-CNN for capturing spatio-temporal features, spatial attention is incorporated into the network to focus on the areas of interest. After feature extraction, temporal attention is utilized to select the significant motions for classification. The proposed method is evaluated on two large scale sign language data sets. The first one, collected by ourselves, is a Chinese sign language data set that consists of 500 categories. The other is the ChaLearn14 benchmark. The experiment results demonstrate the effectiveness of our approach compared with state-of-the-art algorithms.

引用

页码：2822 / 2832

页数：11

共 50 条

[1]

[Anonymous], 2014, ATTENTION FINE GRAIN

[2]

[Anonymous], 2018, AAAI, DOI [DOI 10.1609/AAAI.V32I1.11903, 10.1609/aaai.v32i1.11903]

[3]

[Anonymous], 2015, INT J COMPUT VIS

[4]

Ba J., 2014, Multiple object recognition with visual attention, V1412, P7755

[5]

Chen K., 2015, Abc-cnn: An attention based convolutional neural network for visual question answering

[6]

Chen YR, 2016, PROCEEDINGS OF THE 2016 12TH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION (WCICA), P764, DOI 10.1109/WCICA.2016.7578651

[7] Learning Spatiotemporal Features with 3D Convolutional Networks [J].

Du Tran ;

Bourdev, Lubomir ;

Fergus, Rob ;

Torresani, Lorenzo ;

Paluri, Manohar .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497

[8] Attention pooling-based convolutional neural network for sentence modelling [J].

Er, Meng Joo ;

Zhang, Yong ;

Wang, Ning ;

Pratama, Mahardhika .

INFORMATION SCIENCES, 2016, 373 :388-403

[9] ChaLearn Looking at People Challenge 2014: Dataset and Results [J].

Escalera, Sergio ;

Baro, Xavier ;

Gonzalez, Jordi ;

Bautista, Miguel A. ;

Madadi, Meysam ;

Reyes, Miguel ;

Ponce-Lopez, Victor ;

Escalante, Hugo J. ;

Shotton, Jamie ;

Guyon, Isabelle .

COMPUTER VISION - ECCV 2014 WORKSHOPS, PT I, 2015, 8925 :459-473

[10] Large vocabulary sign language recognition based on fuzzy decision trees [J].

Fang, GL ;

Gao, W ;

Zhao, DB .

IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART A-SYSTEMS AND HUMANS, 2004, 34 (03) :305-314

← 1 2 3 4 5 →