MiikeMineStamps: A Long-Tailed Dataset of Japanese Stamps via Active Learning

被引:0
|
作者
Buitrago, Paola A. [1 ,2 ]
Toropov, Evgeny [3 ]
Prabha, Rajanie [1 ,2 ]
Uran, Julian [1 ,2 ]
Adal, Raja [4 ]
机构
[1] Pittsburgh Supercomp Ctr, Pittsburgh, PA 15203 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA 15203 USA
[3] DeepMap Inc, East Palo Alto, CA 94303 USA
[4] Univ Pittsburgh, Pittsburgh, PA 15260 USA
来源
DOCUMENT ANALYSIS AND RECOGNITION, ICDAR 2021, PT III | 2021年 / 12823卷
基金
美国国家科学基金会;
关键词
Active learning; Object detection; Long tail; Open set; Stamp; Japanese; Historical; Dataset;
D O I
10.1007/978-3-030-86334-0_1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Mining existing image datasets with rich information can help advance knowledge across domains in the humanities and social sciences. In the past, the extraction of this information was often prohibitively expensive and labor-intensive. AI can provide an alternative, making it possible to speed up the labeling and mining of large and specialized datasets via a human-in-the-loop method of active learning (AL). Although AL methods are helpful for certain scenarios, they present limitations when the set of classes is not known before labeling (i.e. an open-ended set) and the distribution of objects across classes is highly unbalanced (i.e. a long-tailed distribution). To address these limitations in object detection scenarios we propose a multi-step approach consisting of 1) object detection of a generic "object" class, and 2) image classification with an open class set and a long tail distribution. We apply our approach to recognizing stamps in a large compendium of historical documents from the Japanese company Mitsui Mi'ike Mine, one of the largest business archives in modern Japan that spans half a century, includes tens of thousands of documents, and has been widely used by labor historians, business historians, and others. To test our approach we produce and make publicly available the novel and expert-curated MiikeMineStamps dataset. This unique dataset consists of 5056 images of 405 different Japanese stamps, which to the best of our knowledge is the first published dataset of historical Japanese stamps. We hope that the MiikeMineStamps dataset will become a useful tool to further explore the application of AI methods to the study of historical documents in Japan and throughout the world of Chinese characters, as well as serve as a benchmark for image classification algorithms with an open-ended and highly unbalanced class set.
引用
收藏
页码:3 / 19
页数:17
相关论文
共 8 条
  • [1] Long-tailed recognition via key attribute learning
    Fu, Yu
    Han, Jungong
    Chang, Xiang
    Chen, Changrui
    Shang, Changjing
    Shen, Qiang
    NEUROCOMPUTING, 2025, 627
  • [2] LONG-TAILED FEDERATED LEARNING VIA AGGREGATED META MAPPING
    Qian, Pinxin
    Lu, Yang
    Wang, Hanzi
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 2010 - 2014
  • [3] Exploring the auxiliary learning for long-tailed visual recognition
    Zhang, Junjie
    Liu, Lingqiao
    Wang, Peng
    Zhang, Jian
    NEUROCOMPUTING, 2021, 449 : 303 - 314
  • [4] Open world long-tailed data classification through active distribution optimization
    Wang, Min
    Zhou, Lei
    Li, Qian
    Zhang, An-an
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 213
  • [5] DIM: long-tailed object detection and instance segmentation via dynamic instance memory
    Chen, Zhao-Min
    Jin, Xin
    Zhang, Xiaoqin
    Xia, Chaoqun
    Pan, Zhiyong
    Deng, Ruoxi
    Hu, Jie
    Chen, Heng
    MACHINE LEARNING-SCIENCE AND TECHNOLOGY, 2023, 4 (03):
  • [6] Learning Box Regression and Mask Segmentation Under Long-Tailed Distribution with Gradient Transfusing
    Wang, Tao
    Yuan, Li
    Wang, Xinchao
    Feng, Jiashi
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (02) : 951 - 967
  • [7] Dataset Refinement for Convolutional Neural Networks via Active Learning
    Liu, Siwen
    Zhu, Rong
    Luo, Yimin
    Wang, Zhongyuan
    Zhou, Liguo
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT III, 2018, 11166 : 565 - 574
  • [8] Tackling Micro-Expression Data Shortage via Dataset Alignment and Active Learning
    Ben, Xianye
    Gong, Chen
    Huang, Tianhuan
    Li, Chuanye
    Yan, Rui
    Li, Yujun
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 5429 - 5443