Copyright Law and the Lifecycle of Machine Learning Models

被引:13
作者
Kretschmer, Martin [1 ,2 ]
Margoni, Thomas [3 ]
Oruc, Pinar [4 ]
机构
[1] Univ Glasgow, Sch Law, Intellectual Property Law, Glasgow, Scotland
[2] CREATe UK Copyright & Creat Econ Ctr, Glasgow, Scotland
[3] Univ Leuven KU Leuven, Ctr IT & IP Law CiTiP, Intellectual Property Law, Fac Law & Criminol, Leuven, Belgium
[4] Univ Manchester, Commercial Law, Manchester, England
基金
英国经济与社会研究理事会;
关键词
Copyright; Artificial intelligence; Text mining; Data mining; EU; Digital single market; TEXT;
D O I
10.1007/s40319-023-01419-3
中图分类号
D9 [法律]; DF [法律];
学科分类号
0301 ;
摘要
Machine learning, a subfield of artificial intelligence (AI), relies on large corpora of data as input for learning algorithms, resulting in trained models that can perform a variety of tasks. While data or information are not subject matter within copyright law, almost all materials used to construct corpora for machine learning are protected by copyright law: texts, images, videos, and so on. There are global policy moves to address the copyright implications of machine learning, in particular in the context of so-called "foundation models" that underpin generative AI. This paper takes a step back, exploring empirically three technological settings through detailed case studies. We set out the established industry methodology of a lifecycle of AI (collecting data, organising data, model training, model operation) to arrive at descriptions suitable for legal analysis. This will allow an assessment of the challenges for a harmonisation of rights, exceptions and disclosure under EU copyright law. The three case studies are:Machine learning for scientific purposes, in the context of a study of regional short-term letting markets;Natural Language Processing (NLP), in the context of large language models;Computer vision, in the context of content moderation of images.We find that the nature and quality of data corpora at the input stage is central to the lifecycle of machine learning. Because of the uncertain legal status of data collection and processing, combined with the competitive advantage gained by firms not disclosing technological advances, the inputs of the models deployed are often unknown. Moreover, the "lawful access" requirement of the EU exception for text and data mining may turn the exception into a decision by rightholders to allow machine learning in the context of their decision to allow access. We assess policy interventions at EU level, seeking to clarify the legal status of input data via copyright exceptions, opt-outs or the forced disclosure of copyright materials. We find that the likely result is a fully copyright-licensed environment of machine learning that may have problematic effects for the structure of industry, innovation and scientific research.
引用
收藏
页码:110 / 138
页数:29
相关论文
共 77 条
[1]  
AirBnB, 2023, TERMS SERV EUR US
[2]  
[Anonymous], 2022, ARXIV
[3]  
[Anonymous], 2023, FINANC TIMES
[4]  
Arnold T, 2019, Journal of Cultural Analytics, DOI [10.22148/16.043, 10.22148/16.043, DOI 10.22148/16.043]
[5]   Science in the age of large language models [J].
Birhane, Abeba ;
Kasirzadeh, Atoosa ;
Leslie, David ;
Wachter, Sandra .
NATURE REVIEWS PHYSICS, 2023, 5 (05) :277-280
[6]   New Insights into Rental Housing Markets across the United States: Web Scraping and Analyzing Craigslist Rental Listings [J].
Boeing, Geoff ;
Waddell, Paul .
JOURNAL OF PLANNING EDUCATION AND RESEARCH, 2017, 37 (04) :457-476
[7]  
Brunstein D, 2023, USING MACHINE LEARNI, DOI [10.2139/ssrn.4407202, DOI 10.2139/SSRN.4407202]
[8]  
Buonocore T., 2019, MAN IS DOCTOR WOMAN
[9]  
Burrow S, 2021, DOI [10.5281/zenodo.4635759, 10.5281/zenodo.4635759, DOI 10.5281/ZENODO.4635759]
[10]  
Cambridge Consultants, 2019, US AI ONL CONT MOD, P51