Adobe Sued: Authors Claim AI Training Used Work Without Consent
The relentless march of Artificial Intelligence (AI) continues to reshape the technological landscape, and Adobe, like virtually every other major tech company, has been a prominent player in this evolution. Since 2023, Adobe has launched a suite of AI-powered services, most notably Firefly, its media-generation powerhouse. However, this aggressive adoption of AI has now landed the company in hot water. A new lawsuit alleges that Adobe utilized pirated books to train one of its AI models, raising serious questions about copyright and ethical AI development. This case, and others like it, are forcing the industry to confront the legal and moral implications of training AI on massive datasets.
The Lawsuit Against Adobe: A Deep Dive
The proposed class-action lawsuit centers around Elizabeth Lyon, an author from Oregon, and her claim that Adobe’s SlimLM program was trained using unauthorized copies of her work, alongside numerous other copyrighted books. SlimLM is described by Adobe as a series of small language models designed for document assistance tasks on mobile devices. The core of the dispute lies in the data used to pre-train SlimLM – specifically, SlimPajama-627B, an open-source dataset released by Cerebras in June 2023.
Understanding SlimPajama and its Origins
According to the lawsuit, the SlimPajama dataset isn’t a clean slate. It’s a derivative of the RedPajama dataset, which itself incorporates the controversial “Books3” collection. The lawsuit explicitly states: “The SlimPajama dataset was created by copying and manipulating the RedPajama dataset (including copying Books3). Thus, because it is a derivative copy of the RedPajama dataset, SlimPajama contains the Books3 dataset, including the copyrighted works of Plaintiff and the Class members.”
This connection to Books3 is critical. Books3 is a massive compilation of 191,000 books that has become a focal point in the ongoing legal battles surrounding AI training data. Its use raises significant copyright concerns, as the books within the collection were allegedly obtained and distributed without the permission of the copyright holders.
A Growing Trend: AI and Copyright Litigation
Adobe isn’t alone in facing these accusations. The lawsuit against them is part of a broader wave of litigation targeting tech companies for allegedly using copyrighted material to train their AI models. This trend highlights the inherent challenges in building AI systems that require vast amounts of data.
Recent Cases Mirror Adobe's Situation
- Apple: In September, Apple was sued for allegedly using copyrighted material, including data from the RedPajama dataset, to train its Apple Intelligence model. The lawsuit accused Apple of copying protected works “without consent and without credit or compensation.”
- Salesforce: A similar lawsuit filed in October accused Salesforce of utilizing RedPajama for training purposes, echoing the concerns raised against Adobe and Apple.
- Anthropic: Perhaps the most significant case to date, Anthropic agreed to a $1.5 billion settlement in September with a group of authors who claimed the company used pirated versions of their work to train its Claude chatbot. This settlement is widely considered a potential turning point in the legal landscape.
These cases demonstrate a clear pattern: AI algorithms are hungry for data, and that data often comes from sources with questionable copyright status. The sheer scale of these datasets makes it difficult to ensure that all materials are properly licensed or obtained legally.
The Books3 Dataset: A Legal Minefield
The Books3 dataset, at the heart of many of these lawsuits, represents a significant legal challenge. Created by combining books from various sources, including the Internet Archive, it was intended to provide a comprehensive resource for AI training. However, the legality of its creation and distribution has been fiercely contested.
The dataset’s origins are rooted in efforts to create an open-source alternative to proprietary datasets used by large tech companies. However, critics argue that the methods used to compile Books3 violated copyright laws, as many of the books were obtained through unauthorized means. The legal battles surrounding Books3 are likely to continue, setting precedents that will shape the future of AI development.
Why is This Happening? The Challenges of AI Training
The current situation stems from the fundamental requirements of modern AI, particularly Large Language Models (LLMs). These models require massive datasets – often measured in terabytes – to learn and perform effectively. The more data an AI model is trained on, the better it can understand and generate human-like text, images, and other content.
However, acquiring and curating these datasets is a complex and expensive undertaking. Companies often turn to publicly available data sources, which may include copyrighted material. The assumption has often been that “fair use” principles apply, allowing for the use of copyrighted material for transformative purposes like AI training. However, the courts are increasingly scrutinizing this assumption, as evidenced by the recent lawsuits.
The Rise of Data Scraping and its Legal Implications
A common practice in AI training is data scraping – the automated extraction of data from websites and other online sources. While data scraping itself isn’t necessarily illegal, it can violate website terms of service and copyright laws if it involves copying protected content without permission. The legality of data scraping is a gray area, and it’s likely to be a key issue in future litigation.
What Does This Mean for the Future of AI?
The lawsuits against Adobe, Apple, Salesforce, and Anthropic signal a significant shift in the legal landscape surrounding AI. These cases are forcing tech companies to re-evaluate their data sourcing practices and consider the potential legal risks associated with using copyrighted material. Here are some potential implications:
- Increased Scrutiny of Datasets: Tech companies will likely face increased scrutiny of the datasets they use to train their AI models. They may be required to demonstrate that they have obtained the necessary licenses or permissions to use copyrighted material.
- Shift Towards Licensed Data: There may be a shift towards using licensed datasets, even if they are more expensive. This could create a new market for data providers who can offer legally compliant datasets.
- Development of New AI Training Techniques: Researchers may explore new AI training techniques that require less data or rely on synthetic data generated without infringing on copyright.
- Greater Transparency: There may be increased pressure for tech companies to be more transparent about the data they use to train their AI models.
The legal battles over AI training data are far from over. As AI technology continues to evolve, we can expect to see more lawsuits and regulatory challenges. The outcome of these cases will have a profound impact on the future of AI development, shaping how AI models are built, deployed, and regulated. The industry needs to proactively address these concerns to ensure that AI is developed ethically and legally.
The case against Adobe serves as a stark reminder that the pursuit of AI innovation cannot come at the expense of copyright and intellectual property rights. As GearTech continues to monitor this evolving situation, it’s clear that the legal and ethical considerations surrounding AI training data will remain a critical focus for the foreseeable future.