The (ab)use of Open Source Code to Train Large Language Models

被引：5

作者：

Al-Kaswan, Ali ^{[1
]}

Izadi, Maliheh ^{[1
]}

机构：

[1] Delft Univ Technol, Delft, Netherlands

来源：

2023 IEEE/ACM 2ND INTERNATIONAL WORKSHOP ON NATURAL LANGUAGE-BASED SOFTWARE ENGINEERING, NLBSE | 2023年

关键词：

D O I：

10.1109/NLBSE59153.2023.00008

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.

引用

页码：9 / 10

页数：2

共 50 条

[21] Comparison of Frontier Open-Source and Proprietary Large Language Models for Complex Diagnoses
Buckley, Thomas A.
Crowe, Byron
Abdulnour, Raja-Elie E.
Rodman, Adam
Manrai, Arjun K.
JAMA HEALTH FORUM, 2025, 6 (03):
[22] Open-source large language models in action: A bioinformatics chatbot for PRIDE database
Bai, Jingwen
Kamatchinathan, Selvakumar
Kundu, Deepti J.
Bandla, Chakradhar
Vizcaino, Juan Antonio
Perez-Riverol, Yasset
PROTEOMICS, 2024, 24 (21-22)
[23] Open-source large language models in medical education: Balancing promise and challenges
Ray, Partha Pratim
ANATOMICAL SCIENCES EDUCATION, 2024, 17 (06) : 1361 - 1362
[24] PharmaLLM: A Medicine Prescriber Chatbot Exploiting Open-Source Large Language Models
Ayesha Azam
Zubaira Naz
Muhammad Usman Ghani Khan
Human-Centric Intelligent Systems, 2024, 4 (4): : 527 - 544
[25] Open-Source Large Language Models in Radiology: A Review and Tutorialfor PracticalResearch and ClinicalDeployment
Savage, Cody H.
Kanhere, Adway
Parekh, Vishwa
Langlotz, Curtis P.
Joshi, Anupam
Huang, Heng
Doo, Florence X.
RADIOLOGY, 2025, 314 (01)
[26] Accessible Russian Large Language Models: Open-Source Models and Instructive Datasets for Commercial Applications
D. P. Kosenko
Yu. M. Kuratov
D. R. Zharikova
Doklady Mathematics, 2023, 108 : S393 - S398
[27] Automated Essay Scoring and Revising Based on Open-Source Large Language Models
Song, Yishen
Zhu, Qianta
Wang, Huaibo
Zheng, Qinhua
IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, 2024, 17 : 1920 - 1930
[28] Accessible Russian Large Language Models: Open-Source Models and Instructive Datasets for Commercial Applications
Kosenko, D. P.
Kuratov, Yu. M.
Zharikova, D. R.
DOKLADY MATHEMATICS, 2023, 108 (SUPPL 2) : S393 - S398
[29] Large Language Models of Code Fail at Completing Code with Potential Bugs
Tuan Dinh
Zhao, Jinman
Tan, Samson
Negrinho, Renato
Lausen, Leonard
Zha, Sheng
Karypis, George
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[30] SGL: A domain-specific language for large-scale analysis of open-source code
Foo, Darius
Yi, Ang Ming
Yeo, Jason
Sharma, Asankhaya
2018 IEEE CYBERSECURITY DEVELOPMENT CONFERENCE (SECDEV 2018), 2018, : 61 - 68

← 1 2 3 4 5 →