The social construction of datasets: On the practices, processes, and challenges of dataset creation for machine learning

被引:0
作者
Orr, Will [1 ,2 ]
Crawford, Kate [2 ,3 ,4 ]
机构
[1] Univ Southern Calif, Annenberg Sch Commun & Journalism, Los Angeles, CA 90007 USA
[2] Microsoft Res New York City, New York, NY USA
[3] Univ Southern Calif, Annenberg Sch, Commun & STS, Los Angeles, CA 90007 USA
[4] Univ Southern Calif, Fac Sci Technol & Soc, Los Angeles, CA 90007 USA
关键词
Accountability; artificial intelligence; datasets; design; machine learning; maintenance;
D O I
10.1177/14614448241251797
中图分类号
G2 [信息与知识传播];
学科分类号
05 ; 0503 ;
摘要
Despite the critical role that datasets play in how systems make predictions and interpret the world, the dynamics of their construction are not well understood. Drawing on a corpus of interviews with dataset creators, we uncover the messy and contingent realities of dataset preparation. We identify four key challenges in constructing datasets, including balancing the benefits and costs of increasing dataset scale, limited access to resources, a reliance on shortcuts for compiling datasets and evaluating their quality, and ambivalence regarding accountability for a dataset. These themes illustrate the ways in which datasets are not objective or neutral but reflect the personal judgments and trade-offs of their creators within wider institutional dynamics, working within social, technical, and organizational constraints. We underscore the importance of examining the processes of dataset creation to strengthen an understanding of responsible practices for dataset development and care.
引用
收藏
页码:4955 / 4972
页数:18
相关论文
共 41 条
  • [1] Baio A., 2022, Ai data laundering: How academic and nonprofit researchers shield tech companies from accountability
  • [2] A Survey of Handwritten Character Recognition with MNIST and EMNIST
    Baldominos, Alejandro
    Saez, Yago
    Isasi, Pedro
    [J]. APPLIED SCIENCES-BASEL, 2019, 9 (15):
  • [3] Birhane Abeba, 2021, arXiv, DOI [arXiv:2110.01963, DOI 10.48550/ARXIV.2110.01963]
  • [4] Bowker G., 1999, Sorting Things Out: Classification and its Consequences, DOI 10.7551/mitpress/6352.001.0001
  • [5] Brown TB, 2020, ADV NEUR IN, V33
  • [6] Chun W. H. K., 2021, Discriminating Data: Correlation, Neighborhoods, and the New Politics of Recognition
  • [7] Crawford Kate, 2019, Excavating AI: The Politics of Images in Machine Learning'
  • [8] Crawford Kate, 2021, Atlas of AI
  • [9] Dodge J, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P1286
  • [10] The irreducible complexity of objectivity
    Douglas, H
    [J]. SYNTHESE, 2004, 138 (03) : 453 - 473