The (ab)use of Open Source Code to Train Large Language Models

被引:5
|
作者
Al-Kaswan, Ali [1 ]
Izadi, Maliheh [1 ]
机构
[1] Delft Univ Technol, Delft, Netherlands
来源
2023 IEEE/ACM 2ND INTERNATIONAL WORKSHOP ON NATURAL LANGUAGE-BASED SOFTWARE ENGINEERING, NLBSE | 2023年
关键词
D O I
10.1109/NLBSE59153.2023.00008
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.
引用
收藏
页码:9 / 10
页数:2
相关论文
共 50 条
  • [41] Iterative Refactoring of Real-World Open-Source Programs with Large Language Models
    Choi, Jinsu
    An, Gabin
    Yoo, Shin
    SEARCH-BASED SOFTWARE ENGINEERING, SSBSE 2024, 2024, 14767 : 49 - 55
  • [42] Comparing Commercial and Open-Source Large Language Models for Labeling Chest Radiograph Reports
    Dorfner, Felix J.
    Juergensen, Liv
    Donle, Leonhard
    Al Mohamad, Fares
    Bodenmann, Tobias R.
    Cleveland, Mason C.
    Busch, Felix
    Adams, Lisa C.
    Sato, James
    Schultz, Thomas
    Kim, Albert E.
    Merkow, Jameson
    Bressem, Keno K.
    Bridge, Christopher P.
    RADIOLOGY, 2024, 313 (01)
  • [43] Automatic structuring of radiology reports with on-premise open-source large language models
    Woznicki, Piotr
    Laqua, Caroline
    Fiku, Ina
    Hekalo, Amar
    Truhn, Daniel
    Engelhardt, Sandy
    Kather, Jakob
    Foersch, Sebastian
    D'Antonoli, Tugba Akinci
    dos Santos, Daniel Pinto
    Baessler, Bettina
    Laqua, Fabian Christopher
    EUROPEAN RADIOLOGY, 2025, 35 (04) : 2018 - 2029
  • [44] Evaluation of Open-Source Large Language Models for Metal-Organic Frameworks Research
    Bai, Xuefeng
    Xie, Yabo
    Zhang, Xin
    Han, Honggui
    Li, Jian-Rong
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2024, 64 (13) : 4958 - 4965
  • [45] Fine-Tuning and Evaluating Open-Source Large Language Models for the Army Domain
    Ruiz, Maj Daniel C.
    Sell, John
    arXiv,
  • [46] The Use of Large Language Models in Education
    Xing, Wanli
    Nixon, Nia
    Crossley, Scott
    Denny, Paul
    Lan, Andrew
    Stamper, John
    Yu, Zhou
    INTERNATIONAL JOURNAL OF ARTIFICIAL INTELLIGENCE IN EDUCATION, 2025,
  • [47] Staged Multi-Strategy Framework With Open-Source Large Language Models for Natural Language to SQL Generation
    Liu, Chuanlong
    Liao, Wei
    Xu, Zhen
    IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING, 2025,
  • [48] Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking
    Zhuang, Shengyao
    Liu, Bing
    Koopman, Bevan
    Zuccon, Guido
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 8807 - 8817
  • [49] A general technique to train language models on language models
    Nederhof, MJ
    COMPUTATIONAL LINGUISTICS, 2005, 31 (02) : 173 - 185
  • [50] Evaluation of Large Language Models on Code Obfuscation (Student Abstract)
    Swindle, Adrian
    McNealy, Derrick
    Krishnan, Giri
    Ramyaa, Ramyaa
    THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23664 - 23666