Data pre-processing to improve the mining of large feed databases

被引：1

作者：

Maroto-Molina, F. ^{[1
]}

Gomez-Cabrera, A. ^{[2
]}

Guerrero-Ginel, J. E. ^{[2
]}

Garrido-Varo, A. ^{[2
]}

Sauvant, D. ^{[3
]}

Tran, G. ^{[4
]}

Heuze, V. ^{[4
]}

Perez-Marin, D. C. ^{[2
]}

机构：

[1] Univ Cordoba, Serv Informac Alimentos, Cordoba 14014, Spain

[2] Univ Cordoba, Dept Anim Prod, ETS Ingn Agron & Montes, Cordoba 14014, Spain

[3] AgroParisTech, UMR Physiol Nutr & Alimentat 791, F-75231 Paris 05, France

[4] AgroParisTech, Assoc Francaise Zootechnie, F-75231 Paris 05, France

来源：

ANIMAL | 2013年 / 7卷 / 07期

关键词：

chemical composition; nutritive value; data integration; outlier mining; QUALITY;

D O I：

10.1017/S1751731113000293

中图分类号：

S8 [畜牧、动物医学、狩猎、蚕、蜂];

学科分类号：

0905 ;

摘要：

The information stored in animal feed databases is highly variable, in terms of both provenance and quality; therefore, data pre-processing is essential to ensure reliable results. Yet, pre-processing at best tends to be unsystematic; at worst, it may even be wholly ignored. This paper sought to develop a systematic approach to the various stages involved in pre-processing to improve feed database outputs. The database used contained analytical and nutritional data on roughly 20 000 alfalfa samples. A range of techniques were examined for integrating data from different sources, for detecting duplicates and, particularly, for detecting outliers. Special attention was paid to the comparison of univariate and multivariate solutions. Major issues relating to the heterogeneous nature of data contained in this database were explored, the observed outliers were characterized and ad hoc routines were designed for error control. Finally, a heuristic diagram was designed to systematize the various aspects involved in the detection and management of outliers and errors.

引用

页码：1128 / 1136

页数：9

共 34 条

[1] ABREU JM, 2000, INTAKE NUTR VALUE ME
[2] ASYMPTOTIC THEORY OF CERTAIN GOODNESS OF FIT CRITERIA BASED ON STOCHASTIC PROCESSES
ANDERSON, TW
DARLING, DA
[J]. ANNALS OF MATHEMATICAL STATISTICS, 1952, 23 (02): : 193 - 212
[3] [Anonymous], 1980, IDENTIFICATION OUTLI, DOI DOI 10.1007/978-94-015-3994-4
[4] [Anonymous], 2011, Pei. data mining concepts and techniques
[5] Breunig M. M., 2000, LOF IDENTIFYING DENS
[6] Chauvenet W., 1960, A Manual of Spherical and Practical Astronomy V. II. 1863. Reprint of 1891, V5th
[7] Gizzi G., 2004, Variability in feed composition and its impact on animal production
[8] Can lignin be accurately measured?
Hatfield, R
Fukushima, RS
[J]. CROP SCIENCE, 2005, 45 (03) : 832 - 839
[9] Mining class outliers: concepts, algorithms and applications in CRM
He, ZY
Xu, XF
Huang, JZX
Deng, SC
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2004, 27 (04) : 681 - 697
[10] Real-world data is dirty: Data cleansing and the merge/purge problem
Hernandez, MA
Stolfo, SJ
[J]. DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (01) : 9 - 37

← 1 2 3 4 →