The (ab)use of Open Source Code to Train Large Language Models

被引：5

作者：

Al-Kaswan, Ali ^{[1
]}

Izadi, Maliheh ^{[1
]}

机构：

[1] Delft Univ Technol, Delft, Netherlands

来源：

2023 IEEE/ACM 2ND INTERNATIONAL WORKSHOP ON NATURAL LANGUAGE-BASED SOFTWARE ENGINEERING, NLBSE | 2023年

关键词：

D O I：

10.1109/NLBSE59153.2023.00008

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.

引用

页码：9 / 10

页数：2

共 50 条

[41] Iterative Refactoring of Real-World Open-Source Programs with Large Language Models
Choi, Jinsu
An, Gabin
Yoo, Shin
SEARCH-BASED SOFTWARE ENGINEERING, SSBSE 2024, 2024, 14767 : 49 - 55
[42] Comparing Commercial and Open-Source Large Language Models for Labeling Chest Radiograph Reports
Dorfner, Felix J.
Juergensen, Liv
Donle, Leonhard
Al Mohamad, Fares
Bodenmann, Tobias R.
Cleveland, Mason C.
Busch, Felix
Adams, Lisa C.
Sato, James
Schultz, Thomas
Kim, Albert E.
Merkow, Jameson
Bressem, Keno K.
Bridge, Christopher P.
RADIOLOGY, 2024, 313 (01)
[43] Automatic structuring of radiology reports with on-premise open-source large language models
Woznicki, Piotr
Laqua, Caroline
Fiku, Ina
Hekalo, Amar
Truhn, Daniel
Engelhardt, Sandy
Kather, Jakob
Foersch, Sebastian
D'Antonoli, Tugba Akinci
dos Santos, Daniel Pinto
Baessler, Bettina
Laqua, Fabian Christopher
EUROPEAN RADIOLOGY, 2025, 35 (04) : 2018 - 2029
[44] Evaluation of Open-Source Large Language Models for Metal-Organic Frameworks Research
Bai, Xuefeng
Xie, Yabo
Zhang, Xin
Han, Honggui
Li, Jian-Rong
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2024, 64 (13) : 4958 - 4965
[45] Fine-Tuning and Evaluating Open-Source Large Language Models for the Army Domain
Ruiz, Maj Daniel C.
Sell, John
arXiv,
[46] The Use of Large Language Models in Education
Xing, Wanli
Nixon, Nia
Crossley, Scott
Denny, Paul
Lan, Andrew
Stamper, John
Yu, Zhou
INTERNATIONAL JOURNAL OF ARTIFICIAL INTELLIGENCE IN EDUCATION, 2025,
[47] Staged Multi-Strategy Framework With Open-Source Large Language Models for Natural Language to SQL Generation
Liu, Chuanlong
Liao, Wei
Xu, Zhen
IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING, 2025,
[48] Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking
Zhuang, Shengyao
Liu, Bing
Koopman, Bevan
Zuccon, Guido
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 8807 - 8817
[49] A general technique to train language models on language models
Nederhof, MJ
COMPUTATIONAL LINGUISTICS, 2005, 31 (02) : 173 - 185
[50] Evaluation of Large Language Models on Code Obfuscation (Student Abstract)
Swindle, Adrian
McNealy, Derrick
Krishnan, Giri
Ramyaa, Ramyaa
THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23664 - 23666

← 1 2 3 4 5 →