An efficient and effective wrapper based on paired t-test for learning naive Bayes classifiers from large-scale domains

被引：7

作者：

Kim, Chanju ^{[1
]}

Li, Honglan ^{[1
]}

Shin, Soo-Yong ^{[2
]}

Hwang, Kyu-Baek ^{[1
]}

机构：

[1] Soongsil Univ, Sch Comp Sci & Engn, Seoul 156743, South Korea

[2] Univ Ulsan Coll Med, Asan Med Ctr, Dept Clin Epidemiol & Esiostsat, Seoul 138736, South Korea

来源：

4TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL SYSTEMS-BIOLOGY AND BIOINFORMATICS (CSBIO2013) | 2013年 / 23卷

关键词：

feature selection; wrappers; Naive Bayes classifiers; microarray data; GENE SELECTION; CLASSIFICATION; CANCER; PREDICTION;

D O I：

10.1016/j.procs.2013.10.014

中图分类号：

Q5 [生物化学]; Q7 [分子生物学];

学科分类号：

071010 ; 081704 ;

摘要：

Feature selection is one of the crucial steps in supervised learning, which influences the entire subsequent classification (or regression) process. The approaches to this task can largely be divided into two categories: filter-based and wrapper-based methods. Generally, the latter produces better results than the former with regard to given learning methods, though it consumes more computational resources for searches over the feature subset space. In this paper, we propose an Efficient wRapper based on a Paired t-Test (ERPT) for choosing features from large-scale data consisting of thousands of variables, such as microarrays. Statistical tests are a reasonable option when the number of features is very large because they have more predictable behavior and can be more efficient than most search methods. The proposed method consists of two phases: decrement phase and increment phase. In the decrement phase, it selects strongly relevant features. In the increment phase, it adds weakly relevant features, given the previously selected features. Our method, combined with naive Bayes classifiers, has been tested in an extensive set of experiments on University of California Irvine (UCI) Machine Learning Repository data. The results showed that the performance of the proposed method is comparable to that of the backward search-based wrapper and superior to that of the forward search-based wrapper. Furthermore, it demonstrated much better performance than the forward search-based wrapper when applied to three microarray data sets, for which the backward search-based wrapper was impractical because of the computational burden involved. The proposed method has the following three merits: (1) it is applicable to data sets having thousands of variables, (2) it provides a theoretically sound and controllable criterion for thresholding features, and (3) it finds feature subsets for the maximizing of classification performance on sparse domains. (C) 2013 The Authors. Published by Elsevier B.V.

引用

页码：102 / 112

页数：11

共 10 条

[1] Effective and Efficient Feature Selection for Large-scale Data Using Bayes' Theorem
Subramanian Appavu Alias Balamurugan
Ramasamy Rajaram
Machine Intelligence Research, 2009, 6 (01) : 62 - 71
[2] Effective and Efficient Feature Selection for Large-scale Data Using Bayes' Theorem
Balamurugan, Subramanian Appavu Alias
Rajaram, Ramasamy
INTERNATIONAL JOURNAL OF AUTOMATION AND COMPUTING, 2009, 6 (01) : 62 - 71
[3] A modified two-sample t-test based on permutation method for large-scale data
Salehi, Mohsen
Mohammadpour, Adel
Mohammadi, Mohammad
Aminghafari, Mina
COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2019, 48 (02) : 372 - 384
[4] Deep Learning-Based Classification and Reconstruction of Residential Scenes From Large-Scale Point Clouds
Zhang, Liqiang
Zhang, Liang
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2018, 56 (04): : 1887 - 1897
[5] Uncovering Predictors of Low Hippocampal Volume: Evidence from a Large-Scale Machine-Learning-Based Study in the UK Biobank
Yeshaw, Yigizie
Madakkatel, Iqbal
Mulugeta, Anwar
Lumsden, Amanda
Hypponen, Elina
NEUROEPIDEMIOLOGY, 2024, 58 (05) : 369 - 382
[6] A Deep Learning-Based Solution for Large-Scale Extraction of the Secondary Road Network from High-Resolution Aerial Orthoimagery
Cira, Calimanut-Ionut
Alcarria, Ramon
Manso-Callejo, Miguel-Angel
Serradilla, Francisco
APPLIED SCIENCES-BASEL, 2020, 10 (20): : 1 - 18
[7] Large-scale land use/land cover extraction from Landsat imagery using feature relationships matrix based deep-shallow learning
Dou, Peng
Shen, Huanfeng
Huang, Chunlin
Li, Zhiwei
Mao, Yujun
Li, Xinghua
INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2024, 129
[8] Large-scale investigations of Neolithic settlement dynamics in Central Germany based on machine learning analysis: A case study from the Weisse Elster river catchment
Miera, Jan Johannes
Schmidt, Karsten
von Suchodoletz, Hans
Ulrich, Mathias
Werther, Lukas
Zielhofer, Christoph
Ettel, Peter
Veit, Ulrich
PLOS ONE, 2022, 17 (04):
[9] Deep Learning-Based Land Cover Extraction from Very-High-Resolution Satellite Imagery for Assisting Large-Scale Topographic Map Production
Hakim, Yofri Furqani
Tsai, Fuan
REMOTE SENSING, 2025, 17 (03)
[10] Large-scale deep learning based binary and semantic change detection in ultra high resolution remote sensing imagery: From benchmark datasets to urban application
Tian, Shiqi
Zhong, Yanfei
Zheng, Zhuo
Ma, Ailong
Tan, Xicheng
Zhang, Liangpei
ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2022, 193 : 164 - 186

← 1 →