The cleanest architecture in the system-level defense papers is simple: separate planning from permission. Let the model propose a plan. Let policy define what the plan is allowed to do. Let an enforcer approve or block the concrete action before execution.
The anti-pattern
plan = llm("solve the task")
for step in plan:
tool.execute(step) // the model is effectively rootThis is convenient and unsafe. The agent's context includes untrusted feedback, so any later plan update can smuggle in a new permission model.
The pattern
plan = llm.create_plan(task)
policy = actpass.issue_passport({
task,
scopes: ["ticket.read", "refund.create"],
limits: { refund_cents: 5000 },
ttl: "1h",
});
for action of plan.actions:
decision = actpass.authorize(policy, action);
if (decision.status !== "allow") stop(decision);
tool.execute(action);Dynamic replanning still works. The agent can react to new facts. What it cannot do is mint new authority for itself after hostile content enters the loop.
Where ActPass sits
- Before execution: deterministic allow/deny/needs-approval.
- During execution: nonce, TTL, scope, target, and session-capability checks.
- After execution: evidence receipt for audits and incident review.
This is why ActPass is infrastructure, not a prompt. It is the enforcement layer the model has to pass through when text becomes action.
Sources: Architecting Secure AI Agents (arXiv:2603.30016), Reasoning-enabled Task Alignment (arXiv:2606.15441), and AttriGuard (arXiv:2603.10749).