The (ab)use of Open Source Code to Train Large Language Models

被引:5
|
作者
Al-Kaswan, Ali [1 ]
Izadi, Maliheh [1 ]
机构
[1] Delft Univ Technol, Delft, Netherlands
来源
2023 IEEE/ACM 2ND INTERNATIONAL WORKSHOP ON NATURAL LANGUAGE-BASED SOFTWARE ENGINEERING, NLBSE | 2023年
关键词
D O I
10.1109/NLBSE59153.2023.00008
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.
引用
收藏
页码:9 / 10
页数:2
相关论文
共 50 条
  • [1] Enhancing Code Security Through Open-Source Large Language Models: A Comparative Study
    Ridley, Norah
    Branca, Enrico
    Kimber, Jadyn
    Stakhanova, Natalia
    FOUNDATIONS AND PRACTICE OF SECURITY, PT I, FPS 2023, 2024, 14551 : 233 - 249
  • [2] Comparative Analysis of Large Language Models in Source Code Analysis
    Erdoğan, Hüseyin
    Turan, Nezihe Turhan
    Onan, Aytuğ
    Lecture Notes in Networks and Systems, 2024, 1088 LNNS : 185 - 192
  • [3] Comparative Analysis of Large Language Models in Source Code Analysis
    Erdogan, Huseyin
    Turan, Nezihe Turhan
    Onan, Aytug
    INTELLIGENT AND FUZZY SYSTEMS, INFUS 2024 CONFERENCE, VOL 1, 2024, 1088 : 185 - 192
  • [4] CodeT5+: Open Code Large Language Models for Code Understanding and Generation
    Wang, Yue
    Le, Hung
    Gotmare, Akhilesh Deepak
    Bui, Nghi D. Q.
    Li, Junnan
    Hoi, Steven C. H.
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 1069 - 1088
  • [5] Language to Code with Open Source Software
    Tang, Lei
    Mao, Xiaoguang
    Zhang, Zhuo
    PROCEEDINGS OF 2019 IEEE 10TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS 2019), 2019, : 561 - 564
  • [6] Re: Open-Source Large Language Models in Radiology
    Kooraki, Soheil
    Bedayat, Arash
    ACADEMIC RADIOLOGY, 2024, 31 (10) : 4293 - 4293
  • [7] Servicing open-source large language models for oncology
    Ray, Partha Pratim
    ONCOLOGIST, 2024,
  • [8] Benchmarking Causal Study to Interpret Large Language Models for Source Code
    Rodriguez-Cardenas, Daniel
    Palacio, David N.
    Khati, Dipin
    Burke, Henry
    Poshyvanyk, Denys
    2023 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION, ICSME, 2023, : 329 - 334
  • [9] Evaluating Source Code Quality with Large Language Models: a comparative study
    da Silva Simões, Igor Regis
    Venson, Elaine
    arXiv,
  • [10] Mapping Source Code to Software Architecture by Leveraging Large Language Models
    Johansson, Nils
    Caporuscio, Mauro
    Olsson, Tobias
    SOFTWARE ARCHITECTURE, ECSA 2024 TRACKS AND WORKSHOPS, 2024, 14937 : 133 - 149