Opening up ChatGPT: Tracking openness, transparency, and accountability in instruction-tuned text generators

被引:47
作者
Liesenfeld, Andreas [1 ]
Lopez, Alianda [1 ]
Dingemanse, Mark [1 ]
机构
[1] Radboud Univ Nijmegen, Ctr Language Studies, Nijmegen, Netherlands
来源
PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON CONVERSATIONAL USER INTERFACES, CUI 2023 | 2023年
基金
荷兰研究理事会;
关键词
open source; survey; chatGPT; large language models; RLHF; AI;
D O I
10.1145/3571884.3604316
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Large language models that exhibit instruction-following behaviour represent one of the biggest recent upheavals in conversational interfaces, a trend in large part fuelled by the release of OpenAI's ChatGPT, a proprietary large language model for text generation fine-tuned through reinforcement learning from human feedback (LLM+RLHF). We review the risks of relying on proprietary software and survey the first crop of open-source projects of comparable architecture and functionality. The main contribution of this paper is to show that openness is differentiated, and to offer scientific documentation of degrees of openness in this fast-moving field. We evaluate projects in terms of openness of code, training data, model weights, RLHF data, licensing, scientific documentation, and access methods. We find that while there is a fast-growing list of projects billing themselves as 'open source', many inherit undocumented data of dubious legality, few share the all-important instruction-tuning (a key site where human annotation labour is involved), and careful scientific documentation is exceedingly rare. Degrees of openness are relevant to fairness and accountability at all points, from data collection and curation to model architecture, and from training and fine-tuning to release and deployment.
引用
收藏
页数:6
相关论文
共 58 条
[1]   The growing influence of industry in AI research [J].
Ahmed, Nur ;
Wahed, Muntasir ;
Thompson, Neil C. .
SCIENCE, 2023, 379 (6635) :884-886
[2]   On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? [J].
Bender, Emily M. ;
Gebru, Timnit ;
McMillan-Major, Angelina ;
Shmitchell, Shmargaret .
PROCEEDINGS OF THE 2021 ACM CONFERENCE ON FAIRNESS, ACCOUNTABILITY, AND TRANSPARENCY, FACCT 2021, 2021, :610-623
[3]   PROTEIN DATA BANK - COMPUTER-BASED ARCHIVAL FILE FOR MACROMOLECULAR STRUCTURES [J].
BERNSTEIN, FC ;
KOETZLE, TF ;
WILLIAMS, GJB ;
MEYER, EF ;
BRICE, MD ;
RODGERS, JR ;
KENNARD, O ;
SHIMANOUCHI, T ;
TASUMI, M .
JOURNAL OF MOLECULAR BIOLOGY, 1977, 112 (03) :535-542
[4]  
Birhane A., 2021, arXiv, DOI [arXiv:2110.01963, DOI 10.48550/ARXIV.2110.01963]
[5]  
Birhane A., 2020, Kvinder, Kon & Forskning, V29, P60, DOI DOI 10.7146/KKF.V29I2.124899
[6]   Large image datasets: A pyrrhic win for computer vision? [J].
Birhane, Abeba ;
Prabhu, Vinay Uday .
2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2021), 2021, :1536-1546
[7]   Algorithmic injustice: a relational ethics approach [J].
Birhane, Abeba .
PATTERNS, 2021, 2 (02)
[8]   Open Science, Open Data, and Open Scholarship: European Policies to Make Science Fit for the Twenty-First Century [J].
Burgelman, Jean-Claude ;
Pascu, Corina ;
Szkuta, Katarzyna ;
Von Schomberg, Rene ;
Karalopoulos, Athanasios ;
Repanas, Konstantinos ;
Schouppe, Michel .
FRONTIERS IN BIG DATA, 2019, 2
[9]  
Carlini N, 2021, Arxiv, DOI [arXiv:2012.07805, DOI 10.48550/ARXIV.2012.07805]
[10]  
Chawla Sanjay, 2023, R Soc Open Sci, V10, P221414, DOI 10.1098/rsos.221414