The (ab)use of Open Source Code to Train Large Language Models

被引:5
|
作者
Al-Kaswan, Ali [1 ]
Izadi, Maliheh [1 ]
机构
[1] Delft Univ Technol, Delft, Netherlands
来源
2023 IEEE/ACM 2ND INTERNATIONAL WORKSHOP ON NATURAL LANGUAGE-BASED SOFTWARE ENGINEERING, NLBSE | 2023年
关键词
D O I
10.1109/NLBSE59153.2023.00008
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.
引用
收藏
页码:9 / 10
页数:2
相关论文
共 50 条
  • [21] Comparison of Frontier Open-Source and Proprietary Large Language Models for Complex Diagnoses
    Buckley, Thomas A.
    Crowe, Byron
    Abdulnour, Raja-Elie E.
    Rodman, Adam
    Manrai, Arjun K.
    JAMA HEALTH FORUM, 2025, 6 (03):
  • [22] Open-source large language models in action: A bioinformatics chatbot for PRIDE database
    Bai, Jingwen
    Kamatchinathan, Selvakumar
    Kundu, Deepti J.
    Bandla, Chakradhar
    Vizcaino, Juan Antonio
    Perez-Riverol, Yasset
    PROTEOMICS, 2024, 24 (21-22)
  • [23] Open-source large language models in medical education: Balancing promise and challenges
    Ray, Partha Pratim
    ANATOMICAL SCIENCES EDUCATION, 2024, 17 (06) : 1361 - 1362
  • [24] PharmaLLM: A Medicine Prescriber Chatbot Exploiting Open-Source Large Language Models
    Ayesha Azam
    Zubaira Naz
    Muhammad Usman Ghani Khan
    Human-Centric Intelligent Systems, 2024, 4 (4): : 527 - 544
  • [25] Open-Source Large Language Models in Radiology: A Review and Tutorialfor PracticalResearch and ClinicalDeployment
    Savage, Cody H.
    Kanhere, Adway
    Parekh, Vishwa
    Langlotz, Curtis P.
    Joshi, Anupam
    Huang, Heng
    Doo, Florence X.
    RADIOLOGY, 2025, 314 (01)
  • [26] Accessible Russian Large Language Models: Open-Source Models and Instructive Datasets for Commercial Applications
    D. P. Kosenko
    Yu. M. Kuratov
    D. R. Zharikova
    Doklady Mathematics, 2023, 108 : S393 - S398
  • [27] Automated Essay Scoring and Revising Based on Open-Source Large Language Models
    Song, Yishen
    Zhu, Qianta
    Wang, Huaibo
    Zheng, Qinhua
    IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, 2024, 17 : 1920 - 1930
  • [28] Accessible Russian Large Language Models: Open-Source Models and Instructive Datasets for Commercial Applications
    Kosenko, D. P.
    Kuratov, Yu. M.
    Zharikova, D. R.
    DOKLADY MATHEMATICS, 2023, 108 (SUPPL 2) : S393 - S398
  • [29] Large Language Models of Code Fail at Completing Code with Potential Bugs
    Tuan Dinh
    Zhao, Jinman
    Tan, Samson
    Negrinho, Renato
    Lausen, Leonard
    Zha, Sheng
    Karypis, George
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [30] SGL: A domain-specific language for large-scale analysis of open-source code
    Foo, Darius
    Yi, Ang Ming
    Yeo, Jason
    Sharma, Asankhaya
    2018 IEEE CYBERSECURITY DEVELOPMENT CONFERENCE (SECDEV 2018), 2018, : 61 - 68