ydata-profiling: Accelerating data-centric AI with high-quality data

被引:6
|
作者
Clemente, Fabiana [1 ]
Ribeiro, Goncalo Martins [1 ]
Quemy, Alexandre [1 ]
Santos, Miriam Seoane [1 ]
Pereira, Ricardo Cardoso [1 ]
Barros, Alex [1 ]
机构
[1] YData Labs Inc, Seattle, WA 98121 USA
关键词
Exploratory data analysis; Data profiling; Data quality; Data-centric AI; Data Intrinsic Characteristics; Data Complexity; TRENDS; CLASSIFICATION; AUTOENCODERS; IMPUTATION;
D O I
10.1016/j.neucom.2023.126585
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
ydata-profiling is an open-source Python package for advanced exploratory data analysis that enables users to generate data profiling reports in a simple, fast, and efficient manner, fostering a standardized and visual understanding of the data. Beyond traditional descriptive properties and statistics, ydata-profiling follows a Data-Centric AI approach to exploratory analysis, as it focuses on the automatic detection and highlighting of complex data characteristics often associated with potential data quality issues, such as high ratios of missing or imbalanced data, infinite, unique, or constant values, skewness, high correlation, high cardinality, non-stationarity, seasonality, duplicate records, and other inconsistencies. The source code, documentation, and examples are available in the GitHub repository: https://github.com/ydataai/ydataprofiling.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Data-Centric AI
    Malerba, Donato
    Pasquadibisceglie, Vincenzo
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2024, 62 (06) : 1493 - 1502
  • [2] The Principles of Data-Centric AI
    Jarrahi, Mohammad Hossein
    Memariani, Ali
    Guha, Shion
    COMMUNICATIONS OF THE ACM, 2023, 66 (08) : 84 - 92
  • [3] A Data-centric AI Framework for Automating Exploratory Data Analysis and Data Quality Tasks
    Patel, Hima
    Guttula, Shanmukha
    Gupta, Nitin
    Hans, Sandeep
    Mittal, Ruhi Sharma
    Lokesh, N.
    ACM JOURNAL OF DATA AND INFORMATION QUALITY, 2023, 15 (04):
  • [4] Data collection and quality challenges in deep learning: a data-centric AI perspective
    Steven Euijong Whang
    Yuji Roh
    Hwanjun Song
    Jae-Gil Lee
    The VLDB Journal, 2023, 32 : 791 - 813
  • [5] Data collection and quality challenges in deep learning: a data-centric AI perspective
    Whang, Steven Euijong
    Roh, Yuji
    Song, Hwanjun
    Lee, Jae-Gil
    VLDB JOURNAL, 2023, 32 (04): : 791 - 813
  • [6] Data-centric AI: Perspectives and Challenges
    Zha, Daochen
    Bhat, Zaid Pervaiz
    Lai, Kwei-Herng
    Yang, Fan
    Hu, Xia
    PROCEEDINGS OF THE 2023 SIAM INTERNATIONAL CONFERENCE ON DATA MINING, SDM, 2023, : 945 - 948
  • [7] Opportunities and Challenges in Data-Centric AI
    Kumar, Sushant
    Datta, Sumit
    Singh, Vishakha
    Singh, Sanjay Kumar
    Sharma, Ritesh
    IEEE ACCESS, 2024, 12 (33173-33189) : 33173 - 33189
  • [8] Next-generation Data Hub Technology for a Data-centric Society through High-quality High-reliability Data Distribution
    Mochida S.
    Nagata T.
    NTT Technical Review, 2021, 19 (02): : 47 - 52
  • [9] dcbench: A Benchmark for Data-Centric AI Systems
    Eyuboglu, Sabri
    Karlas, Bojan
    Re, Christopher
    Zhang, Ce
    Zou, James
    PROCEEDINGS OF THE 6TH WORKSHOP ON DATA MANAGEMENT FOR END-TO-END MACHINE LEARNING, DEEM 2022, 2022,
  • [10] Potential Impact of Data-Centric AI on Society
    Kumar, Sushant
    Sharma, Ritesh
    Singh, Vishakha
    Tiwari, Shrikant
    Singh, Sanjay Kumar
    Datta, Sumit
    IEEE TECHNOLOGY AND SOCIETY MAGAZINE, 2023, 42 (03) : 98 - 107