Data Distribution-Based Curriculum Learning

被引：1

作者：

Chaudhry, Shonal ^{[1
]}

Sharma, Anuraganand ^{[1
]}

机构：

[1] Univ South Pacific, Sch Informat Technol Engn Math & Phys, Laucala Campus, Suva, Fiji

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Training data; Transfer learning; Random forests; Visualization; Robots; Legged locomotion; Classification algorithms; Curriculum development; Machine learning; Support vector machines; Classification; curriculum learning; data distribution; machine learning; neural network; random forest; support vector machine;

D O I：

10.1109/ACCESS.2024.3465793

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The order of training samples can have a significant impact on a model's performance. Curriculum learning is an approach for gradually training a model by ordering samples from 'easy' to 'hard'. This paper proposes the novel idea of a curriculum learning strategy called Data Distribution-based Curriculum Learning (DDCL). DDCL uses the inherent data distribution of a dataset to build a curriculum based on the order of samples. Our proposed approach is innovative as it incorporates two distinct scoring methods known as DDCL-Density and DDCL-Point to determine the order of training samples. The DDCL-Density method assigns scores based on the density of samples favoring denser regions that can make initial learning easier. Conversely, DDCL-Point utilizes the Euclidean distance from the centroid of the dataset as a reference point to score samples providing an alternative perspective on sample difficulty. We evaluate the proposed DDCL approach by conducting experiments across various classifiers using a diverse set of small to medium-sized medical datasets. Results show that DDCL improves the classification accuracy, achieving increases ranging from 2% to 10% compared to baseline methods and other state-of-the-art techniques. Moreover, analysis of the error losses for a single training epoch reveals that DDCL not only improves accuracy but also increases the convergence rate, underlining its potential for more efficient training. The findings suggest that DDCL can specifically be of benefit to medical applications where data is often limited and indicate promising directions for future research in domains that involve limited datasets.

引用

页码：138429 / 138440

页数：12

共 47 条

[1] Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain [J].

Althnian, Alhanoof ;

AlSaeed, Duaa ;

Al-Baity, Heyam ;

Samha, Amani ;

Dris, Alanoud Bin ;

Alzakari, Najla ;

Abou Elwafa, Afnan ;

Kurdi, Heba .

APPLIED SCIENCES-BASEL, 2021, 11 (02) :1-18

[2]

Andrychowicz M, 2016, ADV NEUR IN, V29

[3]

[Anonymous], 2018, Readmission Prediction

[4]

Bengio Y., 2009, Proc. International Conference on Machine Learning, P41, DOI DOI 10.1145/1553374.1553380

[5] Random forests [J].

Breiman, L .

MACHINE LEARNING, 2001, 45 (01) :5-32

[6] Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms [J].

Chang, Victor ;

Bailey, Jozeene ;

Xu, Qianwen Ariel ;

Sun, Zhili .

NEURAL COMPUTING & APPLICATIONS, 2023, 35 (22) :16157-16173

[7] Support vector machines for histogram-based image classification [J].

Chapelle, O ;

Haffner, P ;

Vapnik, VN .

IEEE TRANSACTIONS ON NEURAL NETWORKS, 1999, 10 (05) :1055-1064

[8] SMOTE: Synthetic minority over-sampling technique [J].

Chawla, Nitesh V. ;

Bowyer, Kevin W. ;

Hall, Lawrence O. ;

Kegelmeyer, W. Philip .

2002, American Association for Artificial Intelligence (16)

[9]

Choi J, 2019, Arxiv, DOI arXiv:1908.00262

[10]

Coates A., 2011, AISTATS, P215

← 1 2 3 4 5 →