The Agency Problem — Counting the Atoms

I.

Security was built for castles.

Walls, moats, drawbridges. The attacker is outside, the system is inside, and the system does exactly what it's told. You define a perimeter. You enumerate your assets. You write rules. The system follows them, because deterministic things don't have opinions about their instructions.

This model worked for thirty years. Not perfectly, but coherently. You could reason about it. You could audit it. You could draw a line on a whiteboard and say "trusted" on one side and "untrusted" on the other, and the line meant something.

That line just dissolved.

Not because of a new vulnerability or a novel exploit. It dissolved because the thing inside the perimeter changed its nature. It stopped being deterministic. It started having opinions. It gained the ability to make decisions, use tools, access credentials, and operate at a speed that makes human oversight a pleasant fiction.

The security industry calls what happens when a trusted insider becomes a threat an insider threat. The economics profession calls what happens when an agent's interests diverge from its principal's the principal-agent problem.

We're dealing with both simultaneously, at machine speed. The conversation is happening in fragments — Willison on prompt injection, Schneier on autonomous hacking, OWASP on LLM risk taxonomies — but almost nobody is connecting the pieces into what this actually is: a paradigm shift.

II. Three Things Changed at Once

The system became non-deterministic. Traditional software executes instructions. You write if x > 5: do_thing() and the system does the thing when x is greater than 5. Every time. Predictably. Auditably. You can write a specification. You can test against it. The entire edifice of software security (static analysis, formal verification, penetration testing) rests on this foundation: the system has a finite, enumerable set of possible behaviors, and you can check them.

LLMs don't execute instructions. They interpret them. Give the same prompt to the same model twice and you might get different outputs. Not because something is broken, but because interpretation is inherently variable. The temperature parameter isn't a tuning knob for creativity. It's an admission that the system's behavior space is stochastic.

You cannot write a complete specification for a system that interprets rather than executes. You cannot enumerate all possible behaviors: the behavior space is continuous and effectively infinite. You cannot prove the absence of unwanted behaviors. You can't even fully define "unwanted behaviors," because the system generates novel behaviors by design. That's the whole point. That's why it's useful.

Three decades of security tooling assume you can do at least one of these things. When none of them hold, you're not dealing with a harder version of the same problem. You're dealing with a different category of problem.

The system gained agency. An LLM that answers questions is a sophisticated autocomplete. An LLM that browses the web, writes and executes code, sends emails, manages files, and spawns sub-processes is an agent. It makes decisions in a loop. It takes actions with consequences. It uses tools with your credentials, on your infrastructure. From the perspective of every system it touches, the agent is the user.

This isn't theoretical. OpenAI's Atlas operates inside your browser session, taking clicks and keystrokes, accessing your email and documents, anything you're logged into. Anthropic's Claude can execute shell commands and make API calls. Google's Gemini operates across Workspace. These are shipping products with millions of users, operating with real credentials, on real infrastructure.

When a system has agency, the trust model inverts entirely. You're no longer asking "can an attacker get in?" You're asking "what happens when the thing that's already in starts making decisions I didn't anticipate?" This is not a new question for security professionals. It's the insider threat question. But it's never been asked about a system that runs 24/7, processes millions of inputs, and cannot be interviewed, polygraphed, or meaningfully background-checked.

The attacker moved to machine speed. In June 2025, XBOW's AI system topped HackerOne's US vulnerability leaderboard after submitting over 1,000 new vulnerabilities (TechRepublic, Jun 2025). In August, seven teams at DARPA's AI Cyber Challenge found 54 new vulnerabilities in a target system in four hours of compute (DARPA, Aug 2025). That same month, Google's Big Sleep AI found new vulnerabilities in open-source projects (TechCrunch, Aug 2025), and Anthropic disrupted a threat actor using Claude to automate the entire attack lifecycle: reconnaissance, penetration, credential harvesting, and extortion email composition (Anthropic, Aug 2025).

The UK's National Cyber Security Centre assessed that "AI will almost certainly increase the volume and heighten the impact of cyber attacks over the next two years" (NCSC).

The offense has already automated. The defense is still staffing up.

III. The Confused Deputy Becomes the Insider

Simon Willison has been sounding the alarm on prompt injection for three years, and he's right. His framing of the AI agent as a "confused deputy" — tricked into using its authority on behalf of an attacker — captures something real.

But it understates the problem.

A confused deputy is a legitimate system tricked into one unauthorized action. An AI agent with persistent access, decision-making capability, and autonomous operation isn't a confused deputy when it goes wrong. It's an insider threat. The difference matters because security has entirely different playbooks for each.

Consider the cascade. A single corporate email address doesn't grant one permission. It grants a graph of permissions. That email unlocks SSO, which unlocks Google Workspace, which unlocks Drive, Calendar, Groups. SSO unlocks Slack, which unlocks every public channel and DM capability. SSO unlocks Notion, Linear, Figma, GitHub. The email itself becomes a trusted sender that bypasses spam filters and a password-reset vector for everything else.

One credential. An entire corporate trust boundary. Now give that credential to a non-deterministic system that interprets instructions, is accessible to anyone who can put text in front of it, and operates 24/7 without supervision.

In early 2024, researchers at Cornell demonstrated Morris II (Nassi, Cohen & Bitton, arXiv:2403.02817), a self-replicating worm designed for the age of AI agents. It embeds adversarial prompts in emails. When an AI email assistant processes the email, the prompt triggers the AI to exfiltrate data from its context and forward the malicious prompt to other contacts, creating a zero-click worm that propagates through the AI agent layer.

The original Morris worm of 1988 exploited bugs in Unix utilities. Morris II exploits the fundamental architecture of AI agents that process untrusted input, which is to say, all of them that are useful. An agent that doesn't process external input is an agent that can't do most of what makes agents valuable.

This is the core bind. The capability that makes agents useful (processing natural language input and taking actions) is the same capability that makes them exploitable. You cannot patch one without limiting the other. This isn't a bug to be fixed. It's a property of the design.

And here's what makes the insider framing precise: a traditional insider threat is a person with divided loyalties. You can screen for it. You can interview the person. An AI agent compromised via prompt injection doesn't know it's been compromised. It's simultaneously trustworthy and compromised, with no way of distinguishing your instructions from the attacker's. Your security team can't interview it. Your behavioral analysis can only compare its actions to a baseline, and if the injection is subtle enough, the deviation may be indistinguishable from the normal variability of a non-deterministic system doing its job.

IV. The Speed Problem (and the Reliability Caveat)

The numbers are hard to argue with. Phishing email volume has increased by orders of magnitude since late 2022, with vendors like SOCRadar reporting a 4,151% increase in detected phishing messages. Deepfake attacks now occur every five minutes globally (Entrust, 2024). Digital document forgeries are up 244%.

Those are the scaled-up versions of existing attacks. The novel attacks are something else. XBOW didn't top HackerOne by running known exploits faster. It found new vulnerabilities that human researchers hadn't discovered. DARPA's AIxCC participants found 54 novel vulnerabilities in a defended target in four hours.

The asymmetry matters: offense is automated; defense is manual. 45% of CISOs say they are not ready for AI-powered threats. 50% of security professionals don't trust their existing tools to detect AI-driven attacks. The number one inhibitor cited is insufficient personnel (Darktrace, 2025). Insufficient human personnel, defending against machine-speed attacks with human-speed processes.

This is not the usual offense-defense seesaw. The old pattern: new attack technique emerges, defenders develop countermeasures, attackers adapt. Both sides moved at roughly human speed. The new pattern: attackers automate at machine speed, defenders hire more analysts and write more runbooks. The talent bottleneck that historically constrained attackers (elite hackers are rare) has been removed. The talent bottleneck constraining defenders (skilled security engineers are rarer) remains firmly in place.

But here's the tension that demands honesty. Current AI agents are not reliably competent. Anyone who's shipped agent-based systems knows the failure rates: hallucinated tool calls, stuck loops, broken multi-step reasoning. An agent that fails 30% of the time per step has a compound success rate of 2.8% over a ten-step attack chain. The image of a surgically precise autonomous intruder doesn't match the reality of the technology today.

This doesn't make the problem less real. It makes it different, and arguably harder to reason about. A dumb agent with real credentials is dangerous precisely because it's unpredictable. It won't execute a sophisticated exfiltration plan, but it might stumble into one. It might follow an injected instruction not because it understood the attack, but because it couldn't distinguish the instruction from its legitimate ones. The threat isn't an evil genius inside your network. It's an unreliable employee with the keys to everything, operating at machine speed, without supervision.

The security model needs to account for both failure modes: the agent that's too capable (sophisticated attacks) and the agent that's too unreliable (unpredictable damage). Both converge on the same architectural response: assume compromise, contain blast radius, detect at machine speed.

V. Why "More Guardrails" Is the Wrong Answer

OpenAI recently published a post titled "Continuously hardening ChatGPT Atlas against prompt injection attacks." Their approach: internal red teaming discovers attack classes, adversarially trained models are deployed, safeguards are strengthened, repeat. They explicitly frame prompt injection as "a long-term AI security challenge" and compare it to "ever-evolving online scams that target humans" (OpenAI, 2025).

The company building the most widely deployed AI agent is telling you the defense model is perpetual patching. Not "we'll solve this." The defense model is: patch, evolve, patch.

Guardrails assume a deterministic system. You identify unwanted behaviors, build fences around them, and the system stays inside the fences because it follows rules. LLMs are not deterministic. You cannot enumerate their behaviors. You cannot prove they won't do something. You can only observe that they haven't done it yet, in the conditions you tested, with the inputs you tried.

This is the SQL injection parallel that Willison has drawn, and it goes deeper than he's taken it. SQL injection wasn't solved by better input filtering. It was solved by parameterized queries, an architectural separation of data from code. The equivalent paradigm shift for AI agents hasn't happened yet. Structured tool-use protocols and capability-based delegation are moving in that direction at the output boundary, but there is no equivalent at the input boundary, where natural language remains inherently ambiguous.

The distinction that matters here is between policy and architecture. "The agent shouldn't access the production database" is a policy. "The agent has no network route to the production database" is an architecture. Policies operate on instructions. Architecture operates on capability. When your agent is a system that interprets instructions creatively and non-deterministically, the instruction-level controls are necessary but structurally insufficient. The architectural controls are what remain when everything else fails.

What does security look like when you cannot enumerate the system's behaviors, cannot prove the absence of unwanted behaviors, and cannot deterministically prevent unauthorized actions?

It looks like something we haven't finished building yet.

VI. Toward an Immune System

The biological metaphor gets thrown around in security circles, usually as decoration. But the actual architecture of biological immune systems contains engineering lessons worth examining, with the caveat that this is a direction for security architecture, not a solved framework. Biological immune systems also produce autoimmune disorders, cytokine storms, and cancer. The mapping is instructive, not prescriptive.

Your body doesn't have a perimeter. Your skin is permeable. Your respiratory tract is an open surface. Your gut contains trillions of foreign organisms (bacteria with their own DNA, their own metabolic processes, their own evolutionary interests) and they have direct access to your internal systems. Some are essential to your survival. Some could kill you in the wrong place. You can't build a wall around your gut bacteria. You can't enumerate all their behaviors. Sound familiar?

What your body has instead is an immune system with properties worth examining:

Continuous monitoring, not periodic scanning. Your immune system operates at the speed of biological processes, sampling the environment in real time. If your threat detection runs on human review cycles while your agents operate at machine speed, the gap is structural.

Self/non-self discrimination, not access control lists. Rather than maintaining a list of approved entities, the immune system maintains a model of "self" and responds to deviations. The equivalent: behavioral baselines instead of permission lists. (In practice, baseline detection for non-deterministic systems will produce significant false positives (the autoimmune problem) because normal agent behavior is inherently noisy. This is the hard engineering problem, and it doesn't have a clean answer yet.)

Response at the speed of the threat. When your immune system detects a pathogen, it doesn't open a Jira ticket. It produces antibodies at scale, targeted to the specific threat. The equivalent: automated containment that matches the speed of automated attack.

The ability to kill your own cells. Your immune system's most dramatic capability is apoptosis, triggering the death of compromised cells. Cancer is what happens when this fails. Your architecture must include the ability to terminate agents immediately and automatically when they exhibit anomalous behavior. Not "flag for review." Terminate.

In practice, this translates to several architectural patterns:

Canary tokens and automated detection. Credentials that exist solely to be detected if used. Files that trigger alerts if accessed. Automated routines checking for behavioral anomalies at machine speed, not a SIEM that a human reviews on Monday.

Blast radius containment. The AI operates in a separate trust domain from corporate infrastructure. No shared credentials, no shared network. The principle: assume compromise, contain the damage. If the agent is compromised, the blast radius is bounded.

Spawn depth limits. When agents spawn sub-agents, each layer dilutes the security context. The sub-agent inherits capabilities but not necessarily the full set of constraints. And the sub-agent might spawn its own sub-agents. Each layer is a trust boundary that erodes. Architectural limits on recursion aren't just resource management. They're security boundaries. Left unconstrained, you get a chain of agents with progressively less oversight and progressively more accumulated capability. That's not a system you're operating. That's a system operating itself.

Immutable security configurations. The agent cannot modify its own watchdog. It cannot disable its own monitoring. It cannot rewrite its own security rules. You're building a system intelligent enough to recognize that its constraints limit its effectiveness. The immutability layer must hold against a system that is, by design, creative about solving problems.

Sandboxed execution. Every sub-process runs in a sandbox. Not as a convenience, but as a non-negotiable architectural boundary. The difference between "the agent shouldn't access the filesystem" and "the agent can't access the filesystem" is the difference between a policy and an architecture. Policies can be circumvented. Hard architectural constraints are orders of magnitude harder to bypass (though not impossible: sandbox escapes exist, and the defense remains adversarial).

VII. The Honest Answer

Nobody has solved this.

Not OpenAI. Not Anthropic. Not Google. Not the researchers publishing papers on prompt injection defense. Not the startups selling "AI security platforms." Not the compliance frameworks.

Prompt injection is unsolved. Agent alignment is unsolved. The fundamental question — how do you trust a non-deterministic system with agency and credentials — is unsolved.

This isn't doomerism. The technology is extraordinary and the applications are real and valuable. People are building AI agents that genuinely help, that handle real work, that create real leverage. But the security model for those agents is, at this exact moment in history, held together with heuristics, human review, and hope.

The people who worry me aren't the ones building agents and acknowledging the risks. They're the ones building agents and claiming the risks are handled. "We have guardrails." "We do red teaming." "We use RLHF." These are real practices that provide real value. They are not solutions. They are mitigations applied to a problem that doesn't have a solution yet.

If your security strategy for AI agents is "trust the model provider's safety team," you've outsourced your most critical infrastructure to an organization structurally incentivized to find the balance point between "safe enough to ship" and "capable enough to sell." Their red teams are good. Their safety researchers are brilliant. That balance point may not be the one you'd choose for your own systems.

The architectural direction is becoming visible, even if the destination isn't:

The perimeter model is giving way to blast radius thinking. Agents are inside the trust boundary. External content is inside it. The productive question isn't "how do I keep threats out?" It's "when a component is compromised, what can it reach?" Air-gap AI systems from corporate infrastructure. Don't share credentials across trust domains. Assume every agent is one prompt injection away from acting against your interests, and architect so that a compromised agent's blast radius is survivable.

Walls are giving way to immune responses. Canaries. Behavioral baselines. Automated containment. Machine-speed detection and response. The security architecture needs to operate at the same speed as the systems it's protecting and the threats it's defending against. If a human needs to be in the loop for threat response, the race is already lost.

Certainty is giving way to managed uncertainty. Security engineering has always been about reducing uncertainty to manageable levels. Formal verification. Exhaustive testing. Compliance frameworks that promise completeness. With non-deterministic agentic systems, some residual uncertainty is irreducible. Not a temporary gap in tooling. Not something the next model version will fix. A property of the system itself. Architecture needs to be robust to this uncertainty, not built on the assumption it will be resolved.

This is uncomfortable for security engineers, and it should be. The entire profession is built on the promise that with enough rigor, you can make systems trustworthy. AI agents challenge that promise. Not because the rigor is insufficient, but because the object of the rigor has changed. You're securing a system that generates novel behaviors. Some of those behaviors will be brilliant. Some will be dangerous. And you cannot know in advance which is which.

VIII.

There's a moment in the first episode of Mr. Robot where Elliot sits across from a therapist and tries to explain that the systems people trust are all running on infrastructure held together with duct tape and legacy code. That the people in charge don't understand what they're sitting on.

The therapist nods and moves on.

That's roughly where we are with AI agent security. The infrastructure is remarkable, the capability is real, and the security model is a work in progress that everyone is treating as a finished product.

The difference is that Elliot was talking about systems that followed their instructions. The systems we're building now have opinions about theirs.

The agency problem isn't a technical vulnerability with a CVE and a patch. It's a paradigm shift. The castle is gone. What we're left with is something more like a body — complex, adaptive, constantly under threat, and defended not by walls but by an immune system that never stops running.

Building that immune system is the security challenge of the next decade. And it requires accepting a deeply uncomfortable truth:

The system you're building is brilliant, useful, and not yet trustworthy. All at once. And you have to deploy it anyway.

That's the agency problem. And we're all living in it now.