AI Under Attack: The Copyright Problem

Is ChatGPT's Success Built on Copyright Infringement?

Dec 17, 2024

The rise of generative AI has sparked excitement and creativity, but it also brings significant legal questions to the forefront. OpenAI, the company behind ChatGPT, finds itself at the center of a high-stakes battle over copyright infringement. These lawsuits aren't just about legal technicalities - they're about the legality and the future of these technologies.

Death of a Whistleblower

The drama has been intensified by the tragic story of Suchir Balaji, a 26-year-old former OpenAI researcher who raised serious concerns about the company's data practices and was named in a high-profile lawsuit before his apparent suicide this past November.

Suchir Balaji alleged that OpenAI violated U.S. copyright law by training its AI models, including ChatGPT, on copyrighted materials without proper authorization. He argued that this practice was both unethical and harmful to content creators, and that it undermined the integrity of the internet ecosystem.

His allegations have become a catalyst for a broader conversation about how AI companies use copyrighted material to train their models.

At the heart of this legal fight is a simple but profound question: Are AI companies allowed to use people’s creations as training data without their permission? Or does copyright law stand in the way?

While we wait for a verdict (maybe coming in 2025?), let’s explore some of the legal complexities of OpenAI's copyright challenges, in order to understand what's at stake for creators, tech companies, and everyone who uses AI.

Current Legal Challenges

OpenAI, the company behind ChatGPT, is facing major legal challenges over copyright infringement.

While lawsuits are happening in several countries, the United States is at the center of this legal battle. This is because the U.S. has a powerful influence on global copyright laws, thanks to its strong entertainment and tech industries, well-developed legal system, and role in international treaties.

Many big names, including authors and news organizations, are suing OpenAI in U.S. courts.1 They claim OpenAI used their work without permission to train ChatGPT. OpenAI argues this falls under "fair use," but the outcome is still uncertain. These lawsuits could significantly impact how AI is developed and how copyright laws are applied to new technologies in the future.

Other AI Companies

While OpenAI has been at the center of many high-profile lawsuits, several other AI companies are also facing legal challenges over copyright infringement:

Google is battling multiple lawsuits. Visual artists sued the company in April 2024, alleging that its AI-powered image generator, Imagen, was trained on their copyrighted content without authorization. Additionally, Google faces a trademark infringement lawsuit over its "Gemini" AI system.

Anthropic, an AI startup founded by ex-OpenAI employees, is facing lawsuits from both music publishers and authors. In June 2024, music publishers sued Anthropic for copyright infringement related to its AI chatbot Claude. In August 2024, three authors filed a federal class action lawsuit against Anthropic, accusing the company of using pirated books to train its language models without permission or compensation.

Other AI companies, including Stability AI, Midjourney, DeviantArt, and Runway AI, were ruled by a US District Judge to be violating artists' rights by illegally storing their works in image generation systems.

These lawsuits highlight the industry-wide nature of the copyright challenges facing AI companies. But for now, let’s focus on OpenAI.

What’s the Issue?

The core issue surrounding OpenAI's data usage lies in its extensive practice of scraping publicly available data, including copyrighted material, to train its AI models without obtaining explicit permission from copyright holders. While OpenAI claims to rely on "publicly available" data, much of this content remains protected by copyright, making its commercial use illegal without consent.

Critics argue that OpenAI is engaging in unauthorized scraping from various sources, raising serious concerns about copyright infringement and the lack of compensation for creators. The sheer volume of data used - reportedly hundreds of billions of words - complicates the application of fair use, which only allows limited use of copyrighted material for specific purposes like research or commentary.

OpenAI has been dishonest about the extent of its unauthorized data scraping practices, refusing to disclose how much copyrighted material it utilizes without proper authorization. Moreover, the company maintains an unrealistic stance on the legality of its actions, insisting that it operates within fair use while facing numerous lawsuits alleging otherwise. This gaslighting undermines the seriousness of the legal challenges it faces and raises ethical questions about its responsibility to content creators. By claiming legitimacy amid mounting legal scrutiny, OpenAI appears to evade accountability for its actions, further complicating the dialogue around copyright in the age of AI.

Copyright Law, What’s That?

Copyright law is a set of rules that protects the rights of creators over their original works, such as books, music, films, and art. It gives authors and artists exclusive rights to reproduce, distribute, and display their creations. Others are not allowed to do so without permission from the author.

However, there are exceptions. Certain uses of copyrighted material may fall under "fair use", allowing limited use without permission for purposes like education, commentary, or research. OpenAI claims its mass usage of copyrighted material falls under fair use.

Fair Use

Fair use is a legal doctrine in U.S. copyright law that allows limited use of copyrighted material without permission from the copyright holder. It's designed to balance the interests of copyright owners with the public interest in using creative works for purposes such as criticism, comment, news reporting, teaching, scholarship, or research.

The doctrine is based on four factors that courts consider when determining if a use is fair:

The purpose and character of the use
The nature of the copyrighted work
The amount and substantiality of the portion used
The effect of the use on the potential market for the original work

OpenAI argues its use is transformative and doesn't compete with original works. Critics claim it infringes copyright and threatens their business models. The scale of data used and the commercial nature of AI products complicate the application of fair use. This case could reshape copyright law for the AI era, with significant implications for AI development and content creation industries.

Copyright is Global, Fair Use is Not

One more complication: Copyright law is not fully standardized across the world, but there has been significant international harmonization over time. This process began with the Berne Convention of 1886 and has evolved over the years. However, significant differences remain between jurisdictions, specifically concerning copyright duration, registration requirements, and exceptions (including fair use).

Fair use, as claimed by OpenAI, is not a globally recognized defense. It is primarily a U.S. doctrine allowing limited use of copyrighted material without permission for specific purposes. While a few countries, like the Philippines and Israel, have adopted similar concepts, most jurisdictions, including the European Union, Japan, and China, do not have fair use provisions. Instead, they employ more narrowly defined exceptions, which can pose greater legal risks for OpenAI in those regions. This lack of uniformity means that OpenAI may face more significant challenges in countries without fair use protections.

Selected Cases

Throughout history, technological innovations have run into legal issues, some of them copyright related. Let’s look at a few relevant examples:

Google Books

The Google Books case provides crucial context for OpenAI's legal challenges. In 2015, the U.S. Court of Appeals ruled that Google's digital copying of millions of books for its search and snippet view functions was fair use under copyright law.

The court found Google's use highly transformative, creating a new search functionality without replacing original works. It deemed the project beneficial to the public by enhancing book discoverability and sales, while not significantly impacting authors' potential market or revenue.

Key factors in the ruling:

Google's use was transformative, serving a different purpose from the original works.
The project created a searchable index, not a substitute for books.
Only limited "snippets" were displayed, not entire works.
The court found potential benefits to authors through increased book visibility.

Therefore, Google was allowed to continue their Books project without the authors’ permission: Google could successfully claim fair use.

However, OpenAI's case differs in important ways:

OpenAI's use may be seen as more commercial and less transformative.
The scale of data used is significantly larger.
OpenAI's output can potentially substitute for original works in some cases.

Viacom/YouTube

Viacom, a media company which has the rights to many television productions, sued YouTube in 2007 for $1 billion, claiming the platform facilitated widespread copyright infringement by allowing users to upload unauthorized clips of Viacom’s content. Viacom alleged that over 150,000 clips, including popular television shows, had been viewed billions of times on YouTube.

The primary legal issue revolved around whether YouTube had actual knowledge of specific infringements and whether it was liable for its users’ infringements. The court ultimately ruled in favor of YouTube, stating that it did not have actual knowledge of specific infringements and was not willfully blind to them. The ruling emphasized that the law (the DMCA) protects service providers from liability as long as they act promptly to remove infringing content when notified, which YouTube did.

This case is relevant to OpenAI's situation as it highlights the complexities of copyright law in relation to user-generated content platforms. While YouTube successfully defended itself, OpenAI's use of copyrighted material for AI training may face different challenges, particularly regarding the scale and nature of its data usage.

Napster

Napster, launched in 1999 by Shawn Fanning and Sean Parker, was a revolutionary peer-to-peer file-sharing service that allowed users to easily download and share MP3 music files. It quickly gained popularity due to its user-friendly interface and vast library of songs, enabling access to music for free. This accessibility transformed how people consumed music, bypassing traditional purchasing methods like CDs and vinyl records.

Napster became a cultural phenomenon, attracting millions of users who loved the ability to discover rare tracks and share music with friends. However, its rapid growth led to significant legal challenges from the music industry, culminating in a lawsuit that ultimately shut down the service in 2001. Despite its short lifespan, Napster's impact on the music industry was profound, paving the way for future digital platforms and changing how music is distributed and consumed.

In the landmark case A&M Records, Inc. v. Napster, Inc., decided in 2001, Napster was found guilty of copyright infringement for facilitating the unauthorized sharing of music through its peer-to-peer file-sharing platform. The lawsuit, initiated by major record labels, claimed that Napster allowed users to download and share copyrighted songs without permission, leading to significant financial losses for the music industry.

The court ruled that Napster was liable for (indirect) copyright infringement. It determined that Napster knowingly encouraged infringement by providing a platform for users to share music files directly with one another. The court rejected Napster's fair use defense, stating that downloading and sharing music did not constitute transformative use and that users were engaged in commercial exploitation of copyrighted works.

Ultimately, Napster was ordered to cease operations and later filed for bankruptcy. This case serves as a counter-example to the Google Books ruling, illustrating how innovative digital platforms can face severe legal consequences when they facilitate widespread copyright infringement.

The Napster ruling does not bode well for OpenAI, as it established a precedent for holding digital platforms liable for facilitating copyright infringement, emphasizing that even innovative technologies can face severe legal consequences when they enable unauthorized use of copyrighted content.

What if OpenAI Loses? Will ChatGPT Shut Down?

If OpenAI loses its ongoing legal battles, particularly the high-profile lawsuit initiated by the Authors Guild, the implications could be significant for both the company and the future of ChatGPT. A ruling against OpenAI might not mean an outright shutdown of ChatGPT, but it could lead to substantial changes in how the platform operates.

At stake are potential damages that could amount to billions of dollars, along with the possibility of court-mandated restrictions on how OpenAI uses copyrighted material for training its models. This could force OpenAI to rethink its data acquisition strategies, potentially leading to increased licensing costs and a more limited dataset for training, potentially affecting the performance and capabilities of ChatGPT.

Moreover, a loss could set a precedent that encourages other authors and copyright holders to pursue similar lawsuits against AI companies, creating a ripple effect across the industry. This environment could stifle innovation and lead to a more cautious approach to AI development. While OpenAI may adapt by seeking licensing agreements with content creators, the outcome of these legal challenges will shape the relationship between AI technologies and copyright law. As AI continues to integrate into various sectors, the stakes are high for both creators and developers in navigating this complex legal landscape.

The Road Ahead

As we look ahead to 2025, several key developments are expected in the ongoing legal battles involving OpenAI. The current lawsuits, primarily focused on copyright infringement claims by authors and publishers, are in critical phases that will shape the future of generative AI.

The outcomes of these cases will not only impact OpenAI but also set precedents for the entire AI industry regarding copyright law and fair use. Overall, 2025 promises to be a pivotal year for OpenAI and the broader landscape of generative AI, with significant legal challenges that could reshape the industry.

What are your expectations for these legal battles? Please comment below!

Acknowledgement

Thank you,
Navin Chetty
, for pointing out the death of Suchir Balaji.

Several lawsuits have been filed against OpenAI and its partner Microsoft:

The US Center for Investigative Reporting has sued OpenAI and Microsoft, accusing them of violating copyright laws by using its content to train AI platforms without permission or compensation.
OpenAI is facing lawsuits from The New York Times, other media outlets, and bestselling authors such as John Grisham, Jodi Picoult, and George R.R. Martin.
A group of Canadian news media companies has sued OpenAI for copyright infringement, claiming that the company "scraped" large amounts of content from their sites without permission.
Eight U.S. newspaper publishers have filed a lawsuit against Microsoft and OpenAI, alleging that they unlawfully used millions of their articles to train AI models like ChatGPT without payment or permission.

The main issues raised in these lawsuits include: Using copyrighted content without permission to train AI models, Reproducing substantial portions of copyrighted works verbatim, Removing copyright management information from scraped content, Potential violation of the Digital Millennium Copyright Act.

Sources:

https://www.euronews.com/next/2024/06/28/chatgpt-maker-openai-and-microsoft-facing-legal-fight-over-exploitative-copyright-infringe

https://theconversation.com/canadian-news-media-are-suing-openai-for-copyright-infringement-but-will-they-win-245002

https://www.cnbc.com/2024/04/30/eight-newspaper-publishers-sue-openai-over-copyright-infringement.html

Ira C. Zipperer

Dec 26, 2024

Excellent article and discussion. 👍🏽 I am hoping the parties on both sides can come up with a fair profit-sharing arrangement but that will take a lot of trust, which seems to be in short supply.

Expand full comment

1 reply by P.Q. Rubin

Andrew Li

Dec 23, 2024

Canadian news outlets are currently suing OpenAI for copyright infringement in the amount of $20,000 per article. This could add up to billions in damages if they succeed.

When it comes to image copyrights, we work very closely with our legal dept. when we create some graphic that we use in a film production. If we wanted to make our own version of a real police badge, we have to prove that our version is significantly different. We have to show all of our work like a high school math problem. As creators, we take inspiration from other creators all the time. In fact, it's impossible to design in a vacuum. But we don't take inspiration from millions of sources.

I think this is the major shortfall of any legal challenge by artists. If Gen AI is stealing from everyone, how can a single or a small group of artists sue the Gen AI company? How can the courts determine which part of which artist's work was used in which part of a Gen AI creation? Even the AI experts don't know exactly how it works! It's an impossible situation where the loser is the artist.

4 replies by P.Q. Rubin and others

24 more comments...

Prompting Culture

Discussion about this post