Training data selection based on dataset distillation for rapid deployment in machine-learning workflows

被引：0

作者：

Yuna Jeong

Myunggwon Hwang

Wonkyung Sung

机构：

[1] Korea Institute of Science and Technology Information (KISTI),AI Technology Research Center

[2] University of Science and Technology (UST),undefined

来源：

Multimedia Tools and Applications | 2023年 / 82卷

关键词：

Core-set; Training data; Data selection; Dataset distillation; Machine learning;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Recently, nonlinear machine-learning models have been effectively applied to multimedia data, contributing greatly to various downstream tasks. However, large amounts of training data are required to properly train many parameters and achieve reasonable performance in nonlinear models. Using a large amount of data significantly increases time and cost, which are limited resources of model development and distribution processes. The goal of our study is to construct a core set that approximates the entire original dataset so that we can quickly observe performance changes caused by model redesign or parameter changes in machine learning deployment. The core set is mainly composed of informative samples with a high contribution to the train. We measure the contribution of the sample based on the dataset distillation and perform area-based sampling for generalization. The core set can be construct in a short time by measuring the learning contribution with only a small number of distilled images. The experimental results showed that our method selects more useful samples compared to random sampling.

引用

页码：9855 / 9870

页数：15

共 22 条

[1] Agarwal PK(2005)Geometric approximation via coresets Comb Comput Geom 52 1-30
[2] Har-Peled S(1997)Selective sampling using the query by committee algorithm Mach Learn 282 133-168
[3] Varadarajan KR(2007)Smaller coresets for k-median and k-means clustering Discrete Comput Geom 371 3-19
[4] Freund Y(2013)More effective distributed ML via a stale synchronous parallel parameter server Adv Neural Inf Process Syst 26 1223-1231
[5] Seung HS(1994)A database for handwritten text recognition research IEEE Trans Pattern Anal Mach Intell 165 550-554
[6] Shamir E(1998)Gradient-based learning applied to document recognition Proc IEEE 86 2278-2324
[7] Tishby N(2010)A survey on transfer learning IEEE Trans Knowl Data Eng 2210 1345-1359
[8] Har-Peled S(2008)Visualizing data using t-SNE J Mach Learn Res 9 2579-2605
[9] Kushal A(undefined)undefined undefined undefined undefined-undefined
[10] Ho Q(undefined)undefined undefined undefined undefined-undefined

← 1 2 3 →