Advances, challenges and opportunities in creating data for trustworthy AI

被引：280

作者：

Liang, Weixin ^{[1
]}

Tadesse, Girmaw Abebe ^{[2
]}

Ho, Daniel ^{[3
]}

Li, Fei-Fei ^{[1
]}

Zaharia, Matei ^{[1
]}

Zhang, Ce ^{[4
]}

Zou, James ^{[1
,5
]}

机构：

[1] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA

[2] IBM Res Africa, Nairobi, Kenya

[3] Stanford Univ, Stanford Law Sch, Stanford, CA 94305 USA

[4] Swiss Fed Inst Technol, Dept Comp Sci, Zurich, Switzerland

[5] Stanford Univ, Dept Biomed Data Sci, Stanford, CA 94305 USA

来源：

NATURE MACHINE INTELLIGENCE | 2022年 / 4卷 / 08期

关键词：

LANGUAGE; CONSENT;

D O I：

10.1038/s42256-022-00516-1

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

It has become rapidly clear in the past few years that the creation, use and maintenance of high-quality annotated datasets for robust and reliable AI applications requires careful attention. This Perspective discusses challenges, considerations and best practices for various stages in the data-to-AI pipeline, to encourage a more data-centric approach. As artificial intelligence (AI) transitions from research to deployment, creating the appropriate datasets and data pipelines to develop and evaluate AI models is increasingly the biggest challenge. Automated AI model builders that are publicly available can now achieve top performance in many applications. In contrast, the design and sculpting of the data used to develop AI often rely on bespoke manual work, and they critically affect the trustworthiness of the model. This Perspective discusses key considerations for each stage of the data-for-AI pipeline-starting from data design to data sculpting (for example, cleaning, valuation and annotation) and data evaluation-to make AI more reliable. We highlight technical advances that help to make the data-for-AI pipeline more scalable and rigorous. Furthermore, we discuss how recent data regulations and policies can impact AI.

引用

页码：669 / 677

页数：9

共 117 条

[1]

Abadi Martin, 2016, Proceedings of OSDI '16: 12th USENIX Symposium on Operating Systems Design and Implementation. OSDI '16, P265

[2] MasakhaNER: Named Entity Recognition for African Languages [J].

Adelani, David Ifeoluwa ;

Abbott, Jade ;

Neubig, Graham ;

D'souza, Daniel ;

Kreutzer, Julia ;

Lignos, Constantine ;

Palen-Michel, Chester ;

Buzaaba, Happy ;

Rijhwani, Shruti ;

Ruder, Sebastian ;

Mayhew, Stephen ;

Azime, Israel Abebe ;

Muhammad, Shamsuddeen H. ;

Emezue, Chris Chinenye ;

Nakatumba-Nabende, Joyce ;

Ogayo, Perez ;

Anuoluwapo, Aremu ;

Gitau, Catherine ;

Mbaye, Derguene ;

Alabi, Jesujoba ;

Yimam, Seid Muhie ;

Gwadabe, Tajuddeen Rabiu ;

Ezeani, Ignatius ;

Niyongabo, Rubungo Andre ;

Mukiibi, Jonathan ;

Otiende, Verrah ;

Orife, Iroro ;

David, Davis ;

Ngom, Samba ;

Adewumi, Tosin ;

Rayson, Paul ;

Adeyemi, Mofetoluwa ;

Muriuki, Gerald ;

Anebi, Emmanuel ;

Chukwuneke, Chiamaka ;

Odu, Nkiruka ;

Wairagala, Eric Peter ;

Oyerinde, Samuel ;

Siro, Clemencia ;

Bateesa, Tobius Saul ;

Oloyede, Temilola ;

Wambui, Yvonne ;

Akinode, Victor ;

Nabagereka, Deborah ;

Katusiime, Maurice ;

Awokoya, Ayodele ;

Mboup, Mouhamadane ;

Gebreyohannes, Dibora ;

Tilaye, Henok ;

Nwaike, Kelechi .

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2021, 9 :1116-1131

[3] "WhatWe Can't Measure, We Can't Understand": Challenges to Demographic Data Procurement in the Pursuit of Fairness [J].

Andrus, McKane ;

Spitzer, Elena ;

Brown, Jeffrey ;

Xiang, Alice .

PROCEEDINGS OF THE 2021 ACM CONFERENCE ON FAIRNESS, ACCOUNTABILITY, AND TRANSPARENCY, FACCT 2021, 2021, :249-260

[4]

[Anonymous], DCAI

[5]

[Anonymous], 2020, CODE FREE MACHINE LE

[6]

[Anonymous], SYNTHESIS

[7]

[Anonymous], TITLE 1 81 5 CALIFOR

[8]

[Anonymous], 2020, STATE DATA SCI 2020

[9]

[Anonymous], 2022, CAN ARTIFICIAL INTEL

[10]

Azizzadenesheli K., 2019, INT C LEARN REPR ICL

← 1 2 3 4 5 6 7 8 9 10 →