Data collection and quality challenges in deep learning: a data-centric AI perspective

被引:148
|
作者
Whang, Steven Euijong [1 ]
Roh, Yuji [1 ]
Song, Hwanjun [2 ]
Lee, Jae-Gil [1 ]
机构
[1] Korea Adv Inst Sci & Technol, Daejeon, South Korea
[2] Naver AI Lab, Seongnam, South Korea
来源
VLDB JOURNAL | 2023年 / 32卷 / 04期
基金
新加坡国家研究基金会;
关键词
Data collection; Data quality; Deep learning; Data-centric AI; TRAINING DATA; MACHINE; FAIRNESS; BIAS;
D O I
10.1007/s00778-022-00775-9
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Data-centric AI is at the center of a fundamental shift in software engineering where machine learning becomes the new software, powered by big data and computing infrastructure. Here, software engineering needs to be re-thought where data become a first-class citizen on par with code. One striking observation is that a significant portion of the machine learning process is spent on data preparation. Without good data, even the best machine learning algorithms cannot perform well. As a result, data-centric AI practices are now becoming mainstream. Unfortunately, many datasets in the real world are small, dirty, biased, and even poisoned. In this survey, we study the research landscape for data collection and data quality primarily for deep learning applications. Data collection is important because there is lesser need for feature engineering for recent deep learning approaches, but instead more need for large amounts of data. For data quality, we study data validation, cleaning, and integration techniques. Even if the data cannot be fully cleaned, we can still cope with imperfect data during model training using robust model training techniques. In addition, while bias and fairness have been less studied in traditional data management research, these issues become essential topics in modern machine learning applications. We thus study fairness measures and unfairness mitigation techniques that can be applied before, during, or after model training. We believe that the data management community is well poised to solve these problems.
引用
收藏
页码:791 / 813
页数:23
相关论文
共 50 条
  • [31] A data-centric approach for ethical and trustworthy AI in journalism
    Dierickx, Laurence
    Opdahl, Andreas Lothe
    Khan, Sohail Ahmed
    Linden, Carl-Gustav
    Guerrero Rojas, Diana Carolina
    ETHICS AND INFORMATION TECHNOLOGY, 2024, 26 (04)
  • [32] Enhancing Collaboration and Agility in Data-Centric AI Projects
    Stieler, Fabian
    Baul, Bernhard
    EVALUATION OF NOVEL APPROACHES TO SOFTWARE ENGINEERING, ENASE 2023, 2024, 2028 : 321 - 343
  • [33] Data-centric AI approach for automated wildflower monitoring
    Schouten, Gerard
    Michielsen, Bas S. H. T.
    Gravendeel, Barbara
    PLOS ONE, 2024, 19 (09):
  • [34] A participatory data-centric approach to AI Ethics by Design
    Gerdes, Anne
    APPLIED ARTIFICIAL INTELLIGENCE, 2022, 36 (01)
  • [35] Data-Centric Green AI An Exploratory Empirical Study
    Verdecchia, Roberto
    Cruz, Luis
    Sallou, June
    Lin, Michelle
    Wickenden, James
    Hotellier, Estelle
    2022 INTERNATIONAL CONFERENCE ON ICT FOR SUSTAINABILITY (ICT4S 2022), 2022, : 35 - 45
  • [36] Data-centric Garbage Collection for NAND Flash Devices
    Wang, Chundong
    Wei, Qingsong
    Xue, Mingdi
    Yang, Jun
    Chen, Cheng
    2015 IEEE NON-VOLATILE MEMORY SYSTEMS AND APPLICATIONS SYMPOSIUM (NVMSA), 2015,
  • [37] Machine learning for data-centric epidemic forecasting
    Rodriguez, Alexander
    Kamarthi, Harshavardhan
    Agarwal, Pulak
    Ho, Javen
    Patel, Mira
    Sapre, Suchet
    Prakash, B. Aditya
    NATURE MACHINE INTELLIGENCE, 2024, 6 (10) : 1122 - 1131
  • [38] A Data-Centric Optimization Framework for Machine Learning
    Rausch, Oliver
    Ben-Nun, Tal
    Dryden, Nikoli
    Ivanov, Andrei
    Li, Shigang
    Hoefler, Torsten
    PROCEEDINGS OF THE 36TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, ICS 2022, 2022,
  • [39] Challenges of Information Retrieval and Evaluation in Data-Centric Biology
    Yu, Yi-Kuo
    OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY, 2011, 15 (04) : 239 - 240
  • [40] Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators
    Chatarasi, Prasanth
    Kwon, Hyoukjun
    Parashar, Angshuman
    Pellauer, Michael
    Krishna, Tushar
    Sarkar, Vivek
    ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2022, 19 (01)