Nordia News

Open-source piracy through AI – is AI built wrong?

By Tuomas Pelkonen
Published: 17.10.2023 | Posted in Insights

Understanding open-source and AI integration

Open-source software (OSS) is software with source code that anyone can inspect, modify, and enhance for free. It is often stored in a public repository and shared publicly. OSS typically includes a distribution licence that states how programmers can use, study, modify, and distribute the software. One of the most popular OSS licences is the MIT License. The licence allows a wide range of uses, including using the code in both open-source and proprietary projects, as long as the original licence and copyright notice are retained in the redistributed code.

The ongoing artificial intelligence (AI) boom has also spread into programming, as AI is increasingly being utilised in various ways to improve efficiency, automate tasks, and enhance software development processes.

GitHub Copilot and the ongoing class-action case

The world’s most widely adopted AI-based coding assistant is GitHub Copilot. The tool was released in October 2021 by GitHub, which Microsoft acquired in 2018. GitHub Copilot is powered by OpenAI’s Codex and trained on billions of lines of code. The tool helps programmers write code faster by providing auto-filled suggestions based on their input.

While GitHub Copilot undoubtedly accelerates the code-writing process, concerns have arisen regarding its utilisation of public open-source code; even questions about potential violations of licencing attributions and limitations have been raised. These concerns have now materialised into lawsuits.

In November 2022, a programmer and lawyer, Matthew Butterick, filed a class-action lawsuit in San Francisco against Microsoft, its subsidiary GitHub, and their partner, OpenAI, in which Microsoft owns a substantial stake. In the ongoing case, the plaintiffs claim that the companies trained GitHub Copilot with code from GitHub repositories without complying with open-source licensing terms and that GitHub Copilot unlawfully reproduces their code by generating code for end-users that is nearly identical to the code from GitHub repositories but does not credit the original open-source authors as required by the licensing terms. The lawsuit references 11 different open-source licences, including MIT, GPL, and Apache licences, which all mandate the attribution of the author’s name and the recognition of specific copyrights.

Microsoft and OpenAI demanded the court dismiss the case and argue that the plaintiffs failed to argue that they suffered specific injuries from the companies’ alleged actions. The companies also pointed out that the plaintiffs did not identify the copyrighted works they allegedly misused or the contracts they breached. Microsoft also commented in its filing that the copyright allegations would run into the doctrine of fair use, which allows the unlicensed use of copyrighted works in some special situations. The company and OpenAI also cited a 2021 U.S. Supreme Court decision that Google’s use of Oracle source code to build its Android operating system was transformative and fair use.

The judge denied the defence’s request to dismiss the plaintiffs’ claim that Codex’s ability to reproduce represents a breach of software licencing terms. The judge also rejected the defence request to dismiss the plaintiffs’ claim that GitHub Copilot and Codex reproduce copyrighted code without the required management information and referred to Section 1202(b) of the Digital Millennium Copyright Act, which prohibits the intentional removal or alteration of copyright management information, as well as the distribution of works or copies of works with altered or removed copyright information. Therefore, the case will continue, at least in these respects.

Key takeaways and recommendations for AI-driven software development

The ongoing case is the third major class-action lawsuit against Microsoft and OpenAI. Another two lawsuits were filed in September 2023. The first one accuses the companies of breaking several privacy laws by misusing personal data from hundreds of millions of internet users from social media platforms and other sites to train AI. In the second case, a group of U.S. authors, including Pulitzer Prize winner Michael Chabon, accuse OpenAI of misusing their writing by copying the authors’ works without permission to train ChatGPT to respond to human text prompts.

According to OpenAI’s webpage, OpenAI’s large language models, including the models that power ChatGPT, are developed using three primary sources of information: (1) information that is publicly available on the internet, (2) information that we licence from third parties, and (3) information that our users or our human trainers provide.

The internet contains a vast amount of information and material that is publicly available. However, this does not always mean the material can be used freely. Microsoft and OpenAI have argued that AI training makes fair use of copyrighted material scraped from the internet. The fair use doctrine allows copyright to be limited under certain conditions. Nevertheless, there is still no established legal consensus regarding this matter in the context of artificial intelligence. Microsoft and OpenAI are far from alone in scraping copyrighted material from the internet to train AI systems and many AI tools are created the same way. Suppose the court decisions favour the plaintiffs. In that case, AI developers will have to re-evaluate the way they train AI and the material used in training and critically examine the output the AI tools produce.

Regarding the EU, the class-action case concerning GitHub Copilot will not have immediate legal consequences in Europe. However, the lawsuit can be expected to encourage rights holders to take similar actions in Europe. There is already a pending court case where image service Getty Images has sued Stability AI for alleged copyright infringement.

The ongoing cases also raise questions about the output of AI tools and may potentially have a significant impact on companies that utilise AI tools in programming or generating assets or other material.

The GitHub Copilot case highlights the importance of paying attention to how AI tools are used in a software development process. For companies incorporating AI into software development processes, the following conclusions and guidance can be derived from the case to secure a company’s intellectual property rights and avoid infringing upon others’ rights:

Understand Open-Source Licencing Terms

Companies must have a clear understanding of open-source licenses. Open-source licences such as MIT, GPL, and Apache have specific requirements related to attribution, distribution, and usage. Failure to comply with these terms can lead to copyright infringement.

Compliance and monitoring

Companies should regularly monitor the code generated by AI tools and ensure it complies with open-source licenses.

Companies should also establish internal protocols and safeguards to track compliance throughout the software development process.

Transparency and documentation

It is important to keep good records of AI usage. In a due diligence process, when a company is being sold, there will undoubtedly be increasing interest in whether the target company has used AI in software development or created material using AI.

Read more about our legal services related to technology and IT.

Contact us


Tuomas Pelkonen
Senior Associate, Helsinki +358 40 846 8107

Related News