Attention-Based 3D-CNNs for Large-Vocabulary Sign Language Recognition

被引:132
作者
Huang, Jie [1 ]
Zhou, Wengang [1 ]
Li, Houqiang [1 ]
Li, Weiping [1 ]
机构
[1] Univ Sci & Technol China, Dept Elect Engn & Informat Sci, Hefei 230027, Anhui, Peoples R China
关键词
Sign language recognition; 3D convolutional neural networks; attention mechanism; deep learning; SYSTEM; MODEL;
D O I
10.1109/TCSVT.2018.2870740
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Sign language recognition (SLR) is an important and challenging research topic in the multimedia field. Conventional techniques for SLR rely on hand-crafted features, which achieve limited success. In this paper, we present attention-based 3D-convolutional neural networks (3D-CNNs) for SLR. The framework has two advantages: 3D-CNNs learn spatio-temporal features from raw video without prior knowledge and the attention mechanism helps to select the clue. When training 3D-CNN for capturing spatio-temporal features, spatial attention is incorporated into the network to focus on the areas of interest. After feature extraction, temporal attention is utilized to select the significant motions for classification. The proposed method is evaluated on two large scale sign language data sets. The first one, collected by ourselves, is a Chinese sign language data set that consists of 500 categories. The other is the ChaLearn14 benchmark. The experiment results demonstrate the effectiveness of our approach compared with state-of-the-art algorithms.
引用
收藏
页码:2822 / 2832
页数:11
相关论文
共 50 条
[1]  
[Anonymous], 2014, ATTENTION FINE GRAIN
[2]  
[Anonymous], 2018, AAAI, DOI [DOI 10.1609/AAAI.V32I1.11903, 10.1609/aaai.v32i1.11903]
[3]  
[Anonymous], 2015, INT J COMPUT VIS
[4]  
Ba J., 2014, Multiple object recognition with visual attention, V1412, P7755
[5]  
Chen K., 2015, Abc-cnn: An attention based convolutional neural network for visual question answering
[6]  
Chen YR, 2016, PROCEEDINGS OF THE 2016 12TH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION (WCICA), P764, DOI 10.1109/WCICA.2016.7578651
[7]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[8]   Attention pooling-based convolutional neural network for sentence modelling [J].
Er, Meng Joo ;
Zhang, Yong ;
Wang, Ning ;
Pratama, Mahardhika .
INFORMATION SCIENCES, 2016, 373 :388-403
[9]   ChaLearn Looking at People Challenge 2014: Dataset and Results [J].
Escalera, Sergio ;
Baro, Xavier ;
Gonzalez, Jordi ;
Bautista, Miguel A. ;
Madadi, Meysam ;
Reyes, Miguel ;
Ponce-Lopez, Victor ;
Escalante, Hugo J. ;
Shotton, Jamie ;
Guyon, Isabelle .
COMPUTER VISION - ECCV 2014 WORKSHOPS, PT I, 2015, 8925 :459-473
[10]   Large vocabulary sign language recognition based on fuzzy decision trees [J].
Fang, GL ;
Gao, W ;
Zhao, DB .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART A-SYSTEMS AND HUMANS, 2004, 34 (03) :305-314