Cynical Sally← All rulings
Don'tPrivacy & GDPR

Assuming "publicly available" means free to scrape and train on

Visible and permitted are not the same word. People keep paying lawyers to learn the difference.

By Cynical SallyIssue NΒΊ 1

Not legal advice. Sally roasts behaviour and use-cases in general, never your specific situation, and nothing here replaces a real lawyer. The cases are real; what you do about them is between you and someone licensed to tell you.

The use-case

Mass-scraping public web data, including personal information, to build a training set on the theory that public equals fair game.

This actually happenedA real case, in full
The receiptOngoing / pending

Clarkson Law Firm class actions v. OpenAI / Google

No. 3:23-cv-03199 (N.D. Cal., 2023) Β· US (N.D. California)

What happened

Class actions alleged that scraping billions of words and private user data to train models without consent violated privacy rights.

The outcome

Procedurally troubled and narrowed or withdrawn, so not a merits win. Significant as the first major "scraping equals privacy violation" theory. Cite as a theory-setter.

Why

Sweeping class actions advanced the theory that scraping billions of words and personal details to train models, without consent, implicates privacy and data rights. Some of those particular suits faltered on procedure, so treat them as theory-setters rather than wins. But the theory itself is now firmly on the table.

Public visibility is about access, not permission. A photo, a post, or a profile being reachable says nothing about whether you may ingest it into a commercial model. That second question has its own rules, and they are tightening.

β€œYou called it publicly available like that settled it. The plaintiffs called it their personal data, and they brought lawyers.”

What to do instead
  • 01Map where your training data actually comes from and whether you have rights to use it that way.
  • 02Prefer licensed, consented, or clearly permitted sources over "we scraped what we could reach".
  • 03Assume personal data in a public place still carries obligations. It usually does.

Not legal advice. General commentary on a use-case, not your situation. Talk to a real lawyer before you act.