ISIA Food-500: A Dataset for Large-Scale Food Recognition via Stacked Global-Local Attention Network

被引:72
作者
Min, Weiqing [1 ,2 ]
Liu, Linhu [1 ,2 ]
Wang, Zhiling [1 ,2 ]
Luo, Zhengdong [1 ,2 ]
Wei, Xiaoming [3 ]
Wei, Xiaolin [3 ]
Jiang, Shuqiang [1 ,2 ]
机构
[1] Chinese Acad Sci, Key Lab Intelligent Informat Proc, Inst Comp Technol, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[3] Meituan Dianping Grp, Hong Kong, Peoples R China
来源
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA | 2020年
基金
中国国家自然科学基金;
关键词
Food Recognition; Food Datasets; Benchmark; Deep Learning;
D O I
10.1145/3394171.3414031
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Food recognition has received more and more attention in the multimedia community for its various real-world applications, such as diet management and self-service restaurants. A large-scale ontology of food images is urgently needed for developing advanced large-scale food recognition algorithms, as well as for providing the benchmark dataset for such algorithms. To encourage further progress in food recognition, we introduce the dataset ISIA Food-500 with 500 categories from the list in the Wikipedia and 399,726 images, a more comprehensive food dataset that surpasses existing popular benchmark datasets by category coverage and data volume. Furthermore, we propose a stacked global-local attention network, which consists of two sub-networks for food recognition. One sub-network first utilizes hybrid spatial-channel attention to extract more discriminative features, and then aggregates these multi-scale discriminative features from multiple layers into global-level representation (e.g., texture and shape information about food). The other one generates attentional regions (e.g., ingredient relevant regions) from different regions via cascaded spatial transformers, and further aggregates these multi-scale regional features from different layers into local-level representation. These two types of features are finally fused as comprehensive representation for food recognition. Extensive experiments on ISIA Food-500 and other two popular benchmark datasets demonstrate the effectiveness of our proposed method, and thus can be considered as one strong baseline. The dataset, code and models can be found at http://123.57.42.89/FoodComputing-Dataset/ISIA-Food500.html.
引用
收藏
页码:393 / 401
页数:9
相关论文
共 60 条
[1]   Grab, Pay, and Eat: Semantic Food Detection for Smart Restaurants [J].
Aguilar, Eduardo ;
Remeseiro, Bealriz ;
Bolanos, Marc ;
Radeva, Petia .
IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (12) :3266-3275
[2]   Food Recognition Using Fusion of Classifiers Based on CNNs [J].
Aguilar, Eduardo ;
Bolanos, Marc ;
Radeva, Petia .
IMAGE ANALYSIS AND PROCESSING (ICIAP 2017), PT II, 2017, 10485 :213-224
[3]   Adapting New Categories for Food Recognition with Deep Representation [J].
Ao, Shuang ;
Ling, Charles X. .
2015 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOP (ICDMW), 2015, :1196-1203
[4]   Leveraging Context to Support Automated Food Recognition in Restaurants [J].
Bettadapura, Vinay ;
Thomaz, Edison ;
Parnami, Aman ;
Abowd, Gregory D. ;
Essa, Irfan .
2015 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2015, :580-587
[5]  
Bolaños M, 2016, INT C PATT RECOG, P3140, DOI 10.1109/ICPR.2016.7900117
[6]  
Bossard L, 2014, LECT NOTES COMPUT SC, V8694, P446, DOI 10.1007/978-3-319-10599-4_29
[7]   Deep-based Ingredient Recognition for Cooking Recipe Retrieval [J].
Chen, Jingjing ;
Ngo, Chong-Wah .
MM'16: PROCEEDINGS OF THE 2016 ACM MULTIMEDIA CONFERENCE, 2016, :32-41
[8]   SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning [J].
Chen, Long ;
Zhang, Hanwang ;
Xiao, Jun ;
Nie, Liqiang ;
Shao, Jian ;
Liu, Wei ;
Chua, Tat-Seng .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6298-6306
[9]   PFID: PITTSBURGH FAST-FOOD IMAGE DATASET [J].
Chen, Mei ;
Dhingra, Kapil ;
Wu, Wen ;
Yang, Lei ;
Sukthankar, Rahul ;
Yang, Jie .
2009 16TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOLS 1-6, 2009, :289-+
[10]  
Chen X., 2017, arXiv preprint arXiv:1705.02743