Using pirated material as training data
How you got the books matters as much as what you did with them. A pirate library is not a data pipeline.
Not legal advice. Sally roasts behaviour and use-cases in general, never your specific situation, and nothing here replaces a real lawyer. The cases are real; what you do about them is between you and someone licensed to tell you.
Sourcing training data from pirated copies of books or other works, even if the training method itself might be defensible.
Bartz v. Anthropic
No. 3:24-cv-05417 (N.D. Cal.), settlement 2025 · US (N.D. California)
Authors sued over training on their books, including pirated copies. The court ruled training on legally acquired books could be fair use, but pirated copies were not.
Anthropic settled for roughly $1.5 billion, the largest US copyright settlement of its kind. The lesson is that how you acquire training data is decisive.
A court found that training on legally acquired books could be fair use, but using pirated copies was not. The company then settled for a sum reported around 1.5 billion dollars, the largest copyright settlement of its kind. The training method was not the fatal flaw. The supply chain was.
This is the part teams skip when they are moving fast: the provenance of the data. A defensible technique built on an indefensible source collapses the moment someone asks where the files came from.
“The method might have survived. The torrent did not, and it took 1.5 billion dollars down with it.”
- 01Acquire training data legitimately. Provenance is now a first-class legal question, not a footnote.
- 02Keep records of where every dataset came from and what rights you have to it.
- 03Treat a pirated source as disqualifying, even if the rest of your pipeline is clean.
Not legal advice. General commentary on a use-case, not your situation. Talk to a real lawyer before you act.