ydata-profiling: Accelerating data-centric AI with high-quality data

被引:12
作者
Clemente, Fabiana [1 ]
Ribeiro, Goncalo Martins [1 ]
Quemy, Alexandre [1 ]
Santos, Miriam Seoane [1 ]
Pereira, Ricardo Cardoso [1 ]
Barros, Alex [1 ]
机构
[1] YData Labs Inc, Seattle, WA 98121 USA
关键词
Exploratory data analysis; Data profiling; Data quality; Data-centric AI; Data Intrinsic Characteristics; Data Complexity; TRENDS; CLASSIFICATION; AUTOENCODERS; IMPUTATION;
D O I
10.1016/j.neucom.2023.126585
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
ydata-profiling is an open-source Python package for advanced exploratory data analysis that enables users to generate data profiling reports in a simple, fast, and efficient manner, fostering a standardized and visual understanding of the data. Beyond traditional descriptive properties and statistics, ydata-profiling follows a Data-Centric AI approach to exploratory analysis, as it focuses on the automatic detection and highlighting of complex data characteristics often associated with potential data quality issues, such as high ratios of missing or imbalanced data, infinite, unique, or constant values, skewness, high correlation, high cardinality, non-stationarity, seasonality, duplicate records, and other inconsistencies. The source code, documentation, and examples are available in the GitHub repository: https://github.com/ydataai/ydataprofiling.
引用
收藏
页数:10
相关论文
共 70 条
[21]  
github, 2023, DataPrep
[22]  
github, 2023, AutoViz
[23]  
github, 2023, DeepEye
[24]   Evaluation of freely available data profiling tools for health data research application: a functional evaluation review [J].
Gordon, Ben ;
Fennessy, Clara ;
Varma, Susheel ;
Barrett, Jake ;
McCondochie, Enez ;
Heritage, Trevor ;
Duroe, Oenone ;
Jeffery, Richard ;
Rajamani, Vishnu ;
Earlam, Kieran ;
Banda, Victor ;
Sebire, Neil .
BMJ OPEN, 2022, 12 (05)
[25]   Learning from class-imbalanced data: Review of methods and applications [J].
Guo Haixiang ;
Li Yijing ;
Shang, Jennifer ;
Gu Mingyun ;
Huang Yuanyue ;
Bing, Gong .
EXPERT SYSTEMS WITH APPLICATIONS, 2017, 73 :220-239
[26]  
Hamid O.H., 2022, 2022 8 INT C INF TEC, P196, DOI 10.1109/ITT56123.2022.9863935
[27]   VizML: A Machine Learning Approach to Visualization Recommendation [J].
Hu, Kevin ;
Bakker, Michiel A. ;
Li, Stephen ;
Kraska, Tim ;
Hidalgo, Cesar .
CHI 2019: PROCEEDINGS OF THE 2019 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, 2019,
[28]  
Iafrate F., 2014, Digital Enterprise Design Management, P25, DOI [10.1007/978-3-319-04313-53, DOI 10.1007/978-3-319-04313-53]
[29]  
IBM SPSS Statistics, 2023, about us
[30]  
Jakubik J, 2024, Arxiv, DOI arXiv:2212.11854