The (ab)use of Open Source Code to Train Large Language Models

被引：5

作者：

Al-Kaswan, Ali ^{[1
]}

Izadi, Maliheh ^{[1
]}

机构：

[1] Delft Univ Technol, Delft, Netherlands

来源：

2023 IEEE/ACM 2ND INTERNATIONAL WORKSHOP ON NATURAL LANGUAGE-BASED SOFTWARE ENGINEERING, NLBSE | 2023年

关键词：

D O I：

10.1109/NLBSE59153.2023.00008

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.

引用

页码：9 / 10

页数：2

共 50 条

[1] Enhancing Code Security Through Open-Source Large Language Models: A Comparative Study
Ridley, Norah
Branca, Enrico
Kimber, Jadyn
Stakhanova, Natalia
FOUNDATIONS AND PRACTICE OF SECURITY, PT I, FPS 2023, 2024, 14551 : 233 - 249
[2] Comparative Analysis of Large Language Models in Source Code Analysis
Erdoğan, Hüseyin
Turan, Nezihe Turhan
Onan, Aytuğ
Lecture Notes in Networks and Systems, 2024, 1088 LNNS : 185 - 192
[3] Comparative Analysis of Large Language Models in Source Code Analysis
Erdogan, Huseyin
Turan, Nezihe Turhan
Onan, Aytug
INTELLIGENT AND FUZZY SYSTEMS, INFUS 2024 CONFERENCE, VOL 1, 2024, 1088 : 185 - 192
[4] CodeT5+: Open Code Large Language Models for Code Understanding and Generation
Wang, Yue
Le, Hung
Gotmare, Akhilesh Deepak
Bui, Nghi D. Q.
Li, Junnan
Hoi, Steven C. H.
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 1069 - 1088
[5] Language to Code with Open Source Software
Tang, Lei
Mao, Xiaoguang
Zhang, Zhuo
PROCEEDINGS OF 2019 IEEE 10TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS 2019), 2019, : 561 - 564
[6] Re: Open-Source Large Language Models in Radiology
Kooraki, Soheil
Bedayat, Arash
ACADEMIC RADIOLOGY, 2024, 31 (10) : 4293 - 4293
[7] Servicing open-source large language models for oncology
Ray, Partha Pratim
ONCOLOGIST, 2024,
[8] Benchmarking Causal Study to Interpret Large Language Models for Source Code
Rodriguez-Cardenas, Daniel
Palacio, David N.
Khati, Dipin
Burke, Henry
Poshyvanyk, Denys
2023 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION, ICSME, 2023, : 329 - 334
[9] Evaluating Source Code Quality with Large Language Models: a comparative study
da Silva Simões, Igor Regis
Venson, Elaine
arXiv,
[10] Mapping Source Code to Software Architecture by Leveraging Large Language Models
Johansson, Nils
Caporuscio, Mauro
Olsson, Tobias
SOFTWARE ARCHITECTURE, ECSA 2024 TRACKS AND WORKSHOPS, 2024, 14937 : 133 - 149

← 1 2 3 4 5 →