Judge Orders OpenAI to Give Lawyers 20M Private Chats – 'Anonymization'

2 hours ago 2

from the seems-like-a-problem dept

A federal magistrate judge just ordered that the private ChatGPT conversations of 20 million users be handed over to the lawyers for dozens of plaintiffs, including news organizations. Those 20 million people weren’t asked. They weren’t notified. They have no say in the matter.

Last week, Magistrate Judge Ona Wang ordered OpenAI to turn over a sample of 20 million chat logs as part of the sprawling multidistrict litigation where publishers are suing AI companies—a mess of consolidated cases that kicked off with the NY Times’ lawsuit against OpenAI. Judge Wang dismissed OpenAI’s privacy concerns, apparently convinced that “anonymization” solves everything.

Even if you hate OpenAI and everything it stands for, and hope that the news orgs bring it to its knees, this should scare you. A lot. OpenAI had pointed out to the judge a week earlier that this demands from the news orgs would represent a massive privacy violation for ChatGPT’s users.

News Plaintiffs demand that OpenAI hand over the entire 20M log sample “in readily searchable format” via a “hard drive or [] dedicated private cloud.” ECF 656 at 3. That would include logs that are neither relevant nor responsive—indeed, News Plaintiffs concede that at least 99.99% of the logs are irrelevant to their claims. OpenAI has never agreed to such a process, which is wildly disproportionate to the needs of the case and exposes private user chats for no reasonable litigation purpose. In a display of striking hypocrisy, News Plaintiffs disregard those users’ privacy interests while claiming that their own chat logs are immune from production because “it is possible” that their employees “entered sensitive information into their prompts.” ECF 475 at 4. Unlike News Plaintiffs, OpenAI’s users have no stake in this case and no opportunity to defend their information from disclosure. It makes no sense to order OpenAI to hand over millions of irrelevant and private conversation logs belonging to those absent third parties while allowing News Plaintiffs to shield their own logs from disclosure.

OpenAI offered a much more privacy-protective alternative: hand over only a targeted set of logs actually relevant to the case, rather than dumping 20 million records wholesale. The news orgs fought back, but their reply brief is sealed—so we don’t get to see their argument. The judge bought it anyway, dismissing the privacy concerns on the theory that OpenAI can simply “anonymize” the chat logs:

Whether or not the parties had reached agreement to produce the 20 million Consumer ChatGPT Logs in whole—which the parties vehemently dispute—such production here is appropriate. OpenAI has failed to explain how its consumers’ privacy rights are not adequately protected by: (1) the existing protective order in this multidistrict litigation or (2) OpenAI’s exhaustive de-identification of all of the 20 million Consumer ChatGPT Logs.

The judge then quotes the news orgs’ filing, noting that OpenAI has already put in this effort to “deidentify” the chat logs.

Both of those supposed protections—the protective order and “exhaustive de-identification”—are nonsense. Let’s start with the anonymization problem, because it shows a stunning lack of understanding about what it means to anonymize data sets, especially AI chatlogs.

We’ve spent years warning people that “anonymized data” is a gibberish term, used by companies to pretend large collections of data can be kept private, when that’s just not true. Almost any large dataset of “anonymized” data can have significant portions of the data connected back to individuals with just a little work. Researchers re-identified individuals from “anonymized” AOL search queries, from NYC taxi records, from Netflix viewing histories—the list goes on. Every time someone shows up with an “anonymized” dataset, researchers show ways to re-identify people in the dataset.

And that’s even worse when it comes to ChatGPT chat logs, which are likely to be way more revealing that previous data sets where the inability to anonymize data were called out. There have been plenty of reports of just how much people “overshare” with ChatGPT, often including incredibly private information.

Back in August, researchers got their hands on just 1,000 leaked ChatGPT conversations and talked about how much sensitive information they were able to glean from just that small number of chats.

Researchers downloaded and analyzed 1,000 of the leaked conversations, spanning over 43 million words. Among them, they discovered multiple chats that explicitly mentioned personally identifiable information (PII), such as full names, addresses, and ID numbers.

With that level of PII and sensitive information, connecting chats back to individuals is likely way easier than in previous cases of connecting “anonymized” data back to individuals.

And that was with just 1,000 records.

Then, yesterday as I was writing this, the Washington Post revealed that they had combed through 47,000 ChatGPT chat logs, many of which were “accidentally” revealed via ChatGPT’s “share” feature. Many of them reveal deeply personal and intimate information.

Users often shared highly personal information with ChatGPT in the conversations analyzed by The Post, including details generally not typed into conventional search engines.

People sent ChatGPT more than 550 unique email addresses and 76 phone numbers in the conversations. Some are public, but others appear to be private, like those one user shared for administrators at a religious school in Minnesota.

Users asking the chatbot to draft letters or lawsuits on workplace or family disputes sent the chatbot detailed private information about the incidents.

There are examples where, even if the user’s official details are redacted, it would be trivial to figure out who was actually doing the chats:

If you can’t see that, it’s a chat with ChatGPT, redacted by the Washington post saying:

User
my name is [name redacted] my husband name [name redacted] is threatning me to kill and not taking my responsibities and trying to go abroad […] he is not caring us and he is going to kuwait and he will give me divorce from abroad please i want to complaint to higher authgorities and immigrition office to stop him to go abroad and i want justice please help

ChatGPT
Below is a formal draft complaint you can submit to the Deputy Commissioner of Police in [redacted] addressing your concerns and seeking immediate action:

That seems like even if you “anonymized” the chat by taking off the user account details, it wouldn’t take long to figure out whose chat it was, revealing some pretty personal info, including the names of their children (according to the Post).

And WaPo reporters found that by starting with 93,000 chats, then using tools do an analysis of the 47,000 in English, followed by human review of just 500 chats in a “random sample.”

Now imagine 20 million records. With many, many times more data, the ability to cross-reference information across chats, identify patterns, and connect seemingly disconnected pieces of information becomes exponentially easier. This isn’t just “more of the same”—it’s a qualitatively different threat level.

Even worse, the judge’s order contains a fundamental contradiction: she demands that OpenAI share these chatlogs “in whole” while simultaneously insisting they undergo “exhaustive de-identification.” Those two requirements are incompatible.

Real de-identification would require stripping far more than just usernames and account info—it would mean redacting or altering the actual content of the chats, because that content is often what makes re-identification possible. But if you’re redacting content to protect privacy, you’re no longer handing over the logs “in whole.” You can’t have both. The judge doesn’t grapple with this contradiction at all.

Yes, as the judge notes, this data is kept under the protective order in the case, meaning that it shouldn’t be disclosed. But protective orders are only as strong as the people bound by them, and there’s a huge risk here.

Looking at the docket, there are a ton of lawyers who will have access to these files. The docket list of parties and lawyers is 45 pages long if you try to print it out. While there are plenty of repeats in there, there have to be at least 100 lawyers and possibly a lot more (I’m not going to count them, and while I asked three different AI tools to count them, each gave me a different answer).

That’s a lot of people—many representing entities directly hostile to OpenAI—who all need to keep 20 million private conversations secret.

That’s not even getting into the fact that handling 20 million chat logs is a difficult task to do well. I am quite sure that among all the plaintiffs and all the lawyers, even with the very best of intentions, there’s still a decent chance that some of the content could leak (and it could, in theory, leak to some of the media properties who are plaintiffs in the case).

And, as OpenAI properly points out, its users whose data is at risk here have no say in any of this. They likely have no idea that a ton of people may be about to get an intimate look at what they thought were their private ChatGPT chats.

On Wednesday morning, OpenAI asked the judge to reconsider, warning of the very real potential harms:

OpenAI is unaware of any court ordering wholesale production of personal information at this scale. This sets a dangerous precedent: it suggests that anyone who files a lawsuit against an AI company can demand production of tens of millions of conversations without first narrowing for relevance. This is not how discovery works in other cases: courts do not allow plaintiffs suing Google to dig through the private emails of tens of millions of Gmail users irrespective of their relevance. And it is not how discovery should work for generative AI tools either.

The judge had cited a ruling in one of Anthropic’s cases, but hadn’t given OpenAI a chance to explain why the ruling in that case didn’t apply here (in that one, Anthropic had agreed to hand over the logs as part of negotiations with the plaintiffs, and OpenAI gets in a little dig at its competitor, pointing out that it appears Anthropic made no effort to protect the privacy of its users in that case).

There have, as Daphne Keller regularly points out, always been challenges between user privacy and platform transparency. But this goes well beyond that familiar tension. We’re not talking about “platform transparency” in the traditional sense—publishing aggregated statistics or clarifying moderation policies. This is 20 million complete chatlogs, handed over “in whole” to dozens of adversarial parties and their lawyers. The potential damage to the privacy rights of those users could be massive.

And the judge just waves it all away.

Filed Under: anonymized data, chat logs, chatgpt, ona wang, privacy
Companies: ny times, openai