If you’re involved with an open source community, the chances that you have heard of open source licenses are pretty high. Open source licenses have the important role of defining what others may or may not do with the work in terms of using, modifying, or distributing software code of the open source project. There are a variety of open source licenses, each with different terms and flexibility regarding what one can do with the project’s code. One of the most general and popular concepts of open source licensing are copyleft licenses. Generally, copyleft licenses allow others to use, modify and distribute work under the condition that any derived works are also under the same conditions.
In recent years, GitHub has developed an artificial intelligence tool called Copilot that is an “AI pair programmer” that generates code and coding suggestions based on a user’s natural language input. The goal of Copilot is to make coding easier for the user and save them time on reading documentation. According to this article by the Software Freedom Conservancy, “Copilot was trained on ‘billions of lines of public code…written by others” from numerous GitHub repositories. A controversy has arisen as GitHub has refused to release a list of the repositories used in the training set, and it has been confirmed that copylefted code appears in the training set. Consequently, many people are questioning the legal ambiguity of GitHub’s actions and believe that since copylefted code was used in the training process, Copilot should also be an open source, copylefted project. Contrary to the people who believe this, I feel that GitHub's Copilot is not breaking any legal rules.
There is ambiguity in almost all open source licensing, and in particular, the term that “fair use” has a lot of room for interpretation. According to the Cornell Law School, the term “fair use” has four key factors to consider: (1) the purpose of use, (2) the nature of the original work, (3) the amount taken from the original work, and (4) the market effect to the original work. In addition to the CEO of GitHub declaring that, “training machine learning systems on public data is fair use,” it can be argued that Copilot is protected from legal issues from these four factors of “fair use”. For the purpose of use, Copilot can be viewed as a “transformative” work from all of the copyleft code that it was trained on. The term “transformative” also has room for interpretation, but note that Copilot’s purpose and structure are vastly different from most of the code it was trained on, and hence it can be argued to be a “transformative” work. In terms of the nature of the original work, the code from GitHub can be seen as more factual than fictitious, and so it is under the term “fair use.” As GitHub has not released exactly what lines of code from various repositories, the argument for the amount taken from the original work in favor of fair use is unclear. However, considering that Copilot generates new code using machine learning, most times, the exact creation of copylefted code is not generated all the time which suggests that large snippets of copyleft code are clearly not being used. And finally, Copilot’s use has no market effect on the copyleft code, as anyone can still use it since it is under a copyleft license.
To think about Copilot’s structure and purpose more generally, consider a piece of collage art.
According to this article from Wasted Talent Inc., it is legal to sell collage art as long as it is transformative and can be viewed as substantially different from the original pieces in the collage. Or consider a research paper. Research papers draw information from an abundance of sources, however they also incorporate the author’s own ideas. The copyleft code used in Copilot can be thought of as pieces of a collage or sources in a research paper. Copilot uses them to create the final product, and the final product is notably different from any of the code it draws from.
Finally, from a computer programmer’s perspective, consider everyone’s favorite website: stackoverflow.com. According to this article from Ictrecht, on Stack Overflow, small code snippets that any two programmers could come up with because of its simplicity are not copyright protected. It can be argued that most of the code snippets that Copilot uses can be thought up by anyone, and thus there should not be any legal issues with using it to train Copilot.
In all, the controversy of Copilot demonstrates to us how the legal ambiguity of AI programs may be difficult to work out. Perhaps in the future, a new license that considers the nature of AI in particular may be implemented and popularized, in order to clarify some legal ambiguity. For now, it appears that we will have to assess situations like this on a case by case basis.