Training data selection based on dataset distillation for rapid deployment in machine-learning workflows

被引:0
作者
Yuna Jeong
Myunggwon Hwang
Wonkyung Sung
机构
[1] Korea Institute of Science and Technology Information (KISTI),AI Technology Research Center
[2] University of Science and Technology (UST),undefined
来源
Multimedia Tools and Applications | 2023年 / 82卷
关键词
Core-set; Training data; Data selection; Dataset distillation; Machine learning;
D O I
暂无
中图分类号
学科分类号
摘要
Recently, nonlinear machine-learning models have been effectively applied to multimedia data, contributing greatly to various downstream tasks. However, large amounts of training data are required to properly train many parameters and achieve reasonable performance in nonlinear models. Using a large amount of data significantly increases time and cost, which are limited resources of model development and distribution processes. The goal of our study is to construct a core set that approximates the entire original dataset so that we can quickly observe performance changes caused by model redesign or parameter changes in machine learning deployment. The core set is mainly composed of informative samples with a high contribution to the train. We measure the contribution of the sample based on the dataset distillation and perform area-based sampling for generalization. The core set can be construct in a short time by measuring the learning contribution with only a small number of distilled images. The experimental results showed that our method selects more useful samples compared to random sampling.
引用
收藏
页码:9855 / 9870
页数:15
相关论文
共 22 条
  • [1] Agarwal PK(2005)Geometric approximation via coresets Comb Comput Geom 52 1-30
  • [2] Har-Peled S(1997)Selective sampling using the query by committee algorithm Mach Learn 282 133-168
  • [3] Varadarajan KR(2007)Smaller coresets for k-median and k-means clustering Discrete Comput Geom 371 3-19
  • [4] Freund Y(2013)More effective distributed ML via a stale synchronous parallel parameter server Adv Neural Inf Process Syst 26 1223-1231
  • [5] Seung HS(1994)A database for handwritten text recognition research IEEE Trans Pattern Anal Mach Intell 165 550-554
  • [6] Shamir E(1998)Gradient-based learning applied to document recognition Proc IEEE 86 2278-2324
  • [7] Tishby N(2010)A survey on transfer learning IEEE Trans Knowl Data Eng 2210 1345-1359
  • [8] Har-Peled S(2008)Visualizing data using t-SNE J Mach Learn Res 9 2579-2605
  • [9] Kushal A(undefined)undefined undefined undefined undefined-undefined
  • [10] Ho Q(undefined)undefined undefined undefined undefined-undefined