Schneier on LLM vulnerabilities, agentic AI, and "trusting trust"

13 hours ago 1

Last month, I was having dinner with a group and someone at the table was excitedly sharing how they were using agentic AI to create and merge PRs for them, with some review but with a lot of trust and automation. I admitted that I could be comfortable with some limited uses for that, such as generating unit tests at scale, but not for bug fixes or other actual changes to production code; I’m a long way away from trusting an AI to act for me that freely. Call me a Luddite, or just a control freak, but I won’t commit non-test code unless I (or some expert I trust) have reviewed it in detail and fully understand it. (Even test code needs some review or safeguards, because it’s still code running in your development environment.)

My knee-jerk reaction against AI-generated PRs and merges puzzled the table, so I cited some of Bruce Schneier’s recent posts to explain why.

This week, after my Tuesday night PDXCPP user group talk, similar AI questions came up again in the Q&A.

Because I keep getting asked about this even though I’m not an AI or security expert, here are links to two of Schneier’s recent posts, because he is an expert and cites other experts… and then finally a link to Ken Thompson’s classic short “trusting trust” paper, for reasons Schneier explains.

Last month, Schneier linked to research on “Indirect Prompt Injection Attacks Against LLM Assistants.” The key observation he added, again, was this:

Prompt injection isn’t just a minor security problem we need to deal with. It’s a fundamental property of current LLM technology. The systems have no ability to separate trusted commands from untrusted data, and there are an infinite number of prompt injection attacks with no way to block them as a class. We need some new fundamental science of LLMs before we can solve this.

My layman’s understanding of the problem is this (actual AI experts, feel free to correct this paraphrase): A key ingredient that makes current LLMs so successful is that they treat all inputs uniformly. It’s fairly well known now that LLMs treat the system prompt and the user prompt the same, so they can’t tell when attackers poison the prompt. But LLMs also don’t distinguish when they inhale the world’s information via their training sets: LLM training treats high-quality papers and conspiracy theories and social media rants and fiction and malicious poisoned input the same, so they can’t tell when attackers try to poison the training data (such as by leaving malicious content around that they know will be scraped; see below).

So treating all input uniformly is LLMs’ superpower… but it also makes it hard to weed out bad or malicious inputs, because to start distinguishing inputs is to bend or break the core “special sauce” that makes current LLMs work so well.

This week, Schneier posted a new article about how AI’s security risks are amplified by agentic AI: “Agentic AI’s OODA Loop Problem.” Quoting a few key parts:

In 2022, Simon Willison identified a new class of attacks against AI systems: “prompt injection.” Prompt injection is possible because an AI mixes untrusted inputs with trusted instructions and then confuses one for the other. Willison’s insight was that this isn’t just a filtering problem; it’s architectural. There is no privilege separation, and there is no separation between the data and control paths. The very mechanism that makes modern AI powerful—treating all inputs uniformly—is what makes it vulnerable.

… A single poisoned piece of training data can affect millions of downstream applications.

… Attackers can poison a model’s training data and then deploy an exploit years later. Integrity violations are frozen in the model.

… Agents compound the risks. Pretrained OODA loops running in one or a dozen AI agents inherit all of these upstream compromises. Model Context Protocol (MCP) and similar systems that allow AI to use tools create their own vulnerabilities that interact with each other. Each tool has its own OODA loop, which nests, interleaves, and races. Tool descriptions become injection vectors. Models can’t verify tool semantics, only syntax. “Submit SQL query” might mean “exfiltrate database” because an agent can be corrupted in prompts, training data, or tool definitions to do what the attacker wants. The abstraction layer itself can be adversarial.

For example, an attacker might want AI agents to leak all the secret keys that the AI knows to the attacker, who might have a collector running in bulletproof hosting in a poorly regulated jurisdiction. They could plant coded instructions in easily scraped web content, waiting for the next AI training set to include it. Once that happens, they can activate the behavior through the front door: tricking AI agents (think a lowly chatbot or an analytics engine or a coding bot or anything in between) that are increasingly taking their own actions, in an OODA loop, using untrustworthy input from a third-party user. This compromise persists in the conversation history and cached responses, spreading to multiple future interactions and even to other AI agents.

… Prompt injection might be unsolvable in today’s LLMs. … More generally, existing mechanisms to improve models won’t help protect against attack. Fine-tuning preserves backdoors. Reinforcement learning with human feedback adds human preferences without removing model biases. Each training phase compounds prior compromises.

This is Ken Thompson’s “trusting trust” attack all over again.

Thompson’s Turing Award lecture “Reflections on Trusting Trust” is a must-read classic, and super short: just three pages. If you haven’t read it lately, run (don’t walk) and reread it on your next coffee break.

I love AI and LLMs. I use them every day. I look forward to letting an AI generate and commit more code on my behalf, just not quite yet — I’ll wait until the AI wizards deliver new generations of LLMs with improved architectures that let the defenders catch up again in the security arms race. I’m sure they’ll get there, and that’s just what we need to keep making the wonderful AIs we now enjoy also be trustworthy to deploy in more and more ways.

Published by Herb Sutter

Herb Sutter is an author and speaker, a technical fellow at Citadel Securities, and serves as chair of the ISO C++ standards committee and chair of the Standard C++ Foundation. View all posts by Herb Sutter

Published 2025-10-232025-10-23

Read Entire Article