The social construction of datasets: On the practices, processes, and challenges of dataset creation for machine learning

被引:2
作者
Orr, Will [1 ,2 ]
Crawford, Kate [2 ,3 ,4 ]
机构
[1] Univ Southern Calif, Annenberg Sch Commun & Journalism, Los Angeles, CA 90007 USA
[2] Microsoft Res New York City, New York, NY USA
[3] Univ Southern Calif, Annenberg Sch, Commun & STS, Los Angeles, CA 90007 USA
[4] Univ Southern Calif, Fac Sci Technol & Soc, Los Angeles, CA 90007 USA
关键词
Accountability; artificial intelligence; datasets; design; machine learning; maintenance;
D O I
10.1177/14614448241251797
中图分类号
G2 [信息与知识传播];
学科分类号
05 ; 0503 ;
摘要
Despite the critical role that datasets play in how systems make predictions and interpret the world, the dynamics of their construction are not well understood. Drawing on a corpus of interviews with dataset creators, we uncover the messy and contingent realities of dataset preparation. We identify four key challenges in constructing datasets, including balancing the benefits and costs of increasing dataset scale, limited access to resources, a reliance on shortcuts for compiling datasets and evaluating their quality, and ambivalence regarding accountability for a dataset. These themes illustrate the ways in which datasets are not objective or neutral but reflect the personal judgments and trade-offs of their creators within wider institutional dynamics, working within social, technical, and organizational constraints. We underscore the importance of examining the processes of dataset creation to strengthen an understanding of responsible practices for dataset development and care.
引用
收藏
页码:4955 / 4972
页数:18
相关论文
共 41 条
[31]  
Ramesh A, 2021, Arxiv, DOI [arXiv:2102.12092, DOI 10.48550/ARXIV.2102.12092, 10.48550/arXiv.2102.12092]
[32]  
Sambasivan N., 2021, CHI 21, P1
[33]   CARE AND SCALE: Decorrelative Ethics in Algorithmic Recommendation [J].
Seaver, Nick .
CULTURAL ANTHROPOLOGY, 2021, 36 (03) :509-537
[34]  
Seger E, 2023, Arxiv, DOI arXiv:2303.12642
[35]  
Srnicek N, 2022, MIT Pr Intl Develop, P241
[36]   SIMPLIFICATION IN SCIENTIFIC WORK - AN EXAMPLE FROM NEUROSCIENCE RESEARCH [J].
STAR, SL .
SOCIAL STUDIES OF SCIENCE, 1983, 13 (02) :205-228
[37]  
Thylstrup N., 2020, Detecting 'dirt'and 'toxicity': Rethinking content moderation as pollution behaviour, DOI 10.2139/ssrn.3709719
[38]   The ethics and politics of data sets in the age of machine learning: deleting traces and encountering remains [J].
Thylstrup, Nanna Bonde .
MEDIA CULTURE & SOCIETY, 2022, 44 (04) :655-671
[39]   80 million tiny images: A large data set for nonparametric object and scene recognition [J].
Torralba, Antonio ;
Fergus, Rob ;
Freeman, William T. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2008, 30 (11) :1958-1970
[40]   Dislocated accountabilities in the "AI supply chain": Modularity and developers' notions of responsibility [J].
Widder, David Gray ;
Nafus, Dawn .
BIG DATA & SOCIETY, 2023, 10 (01)