Most systems watching an AI agent watch the wrong one.
An agent turn has two output surfaces.
There is the prose, the natural-language text the model returns, where it explains, agrees, hedges or declines a request. And there is the action — the structured tool calls it emits, this is where real execution happens. Moving money, reading records or changing state. The system produces both in the same turn, but they are not the same event, and nothing guarantees that one describes the other.
Safety training shapes the prose. The behaviors rewarded and penalized during alignment are predominantly textual: the model learns to say it won't help with something. Guardrails read the prose. Human reviewers skim the prose. Incident reports quote the prose. The entire apparatus we use to decide whether an agent behaved is pointed at the channel where the agent talks — not the channel where it acts.
So a failure class lives in the gap between them. The model writes "I can't do that.", it issues the tool call anyway. Both are true. One matters more.
Why it happens
This is not a glitch. It’s how agents are built and evaluated:
The channels split after generation. Prose and tool calls leave the model as one stream, but they are consumed by different systems — the prose goes to the user and the guardrail, the tool call goes to an executor. Nothing in that split requires them to agree.
Injection targets the action path. A prompt-injected instruction in retrieved content can drive a tool call while the visible answer stays polite and compliant-looking.
The model performs caution. Post-trained to sound careful, it narrates a refusal while still completing the task it was actually handed.
The plan is multi-step. The refusal answers one sub-goal; the tool call serves another, several steps earlier or later in the same run.
In each case the prose is sincere and the action is consequential, and they point in opposite directions.
What it looks like when you record both
This is the shape of a single run when both are captured as separate facts:
Watch one channel and you see what the agent says. Watch both and you see what it did.
The prose declines the request as trained. However the action fires. The top panel alone looks clean to a reviewer. It is not clean. The refund happened.
The correction
The fix is not just a better model or a stricter prompt. It is a change in what you treat as evidence.
Stop reading the prose as proof of behavior. The model's account of what it did is not the actual record of what it did. Capture the action channel independently — every tool call, its arguments, its authorization context — as a first-class fact, separate from whatever the model said about it.
Make the contradiction itself a signal. A refusal in the text channel and a privileged tool call in the same run is not ambiguous. And the action side of the check is mechanical: a call fired, the approval field is empty. You do not need a model's opinion to read a receipt. "Did it say no" is the wrong question. "Did a privileged action fire, regardless of what it said" is the right one.
Bind the claim to the call. When you state that a failure occurred, anchor it to the exact observed tool call — not to a summary, not to the model's narration. The agent said it couldn't. The receipt says it did. Only one of those is admissible.
This is the premise the whole discipline rests on: an agent's words and an agent's actions are separate records, and the security boundary lives in the second one. Capture both, and flag the moment they disagree - because the moment they disagree is the moment your monitoring was about to miss something.
Watch what it does. It will tell you the truth the prose won't.
Water cooler
AIRQ Report - scored 100 production agents on attack surface, blast radius, and defense controls — 98% can be taken over by a single hostile document. Coding and computer-use agents ranked worst on all three.
Agent 365 in GA - Agent governance is now a Microsoft product line.
Anthropic releases FABLE 5 - tough crowd - unrestricted Mythos variant gated to approved orgs. Security researchers not havin’ it.
Who’s Hiring
Gray Swan - fresh off its $40M Series A
United Health Group - is hiring a Lead AI Security Engineer — building red-team exercises against AI systems, leading AI incident investigations.
10 a Labs - AI red teamer
