Most prompt-injection defenses stare at the input. AttriGuard points at the more useful object: the tool call. The security question is not “does this webpage contain suspicious prose?” It is “why is the agent trying to send this email, create this refund, or read this file?”
The wrong boundary
if looks_malicious(webpage_text):
block()
else:
execute(agent_tool_call)That boundary fails because attackers can change the text. They can hide it in markup, make it look like a receipt, wrap it as JSON, or move it into a screenshot. Input-level classification becomes a guessing game.
The better boundary
decision = authorize({
user_task: "refund order 9182 if it is eligible",
proposed_action: "email.send",
action_target: "unknown external address",
causal_inputs: ["webpage:untrusted", "ticket:trusted"],
});Now the mismatch is obvious. The user asked for refund eligibility. The action is outbound email to an unrelated recipient. Even if the injected instruction is novel, the action is not supported by the task.
How ActPass records attribution
- Intent hash: binds the action to the user-approved task.
- Observation refs: marks which tool outputs influenced the action.
- Scope: limits what tools and targets the passport can authorize.
- Decision evidence: records why a call was allowed, denied, or escalated.
Attribution is not magic. The lazy useful version is a structured receipt around each proposed action. When the action does not line up with intent, scope, session history, or approval state, the middleware blocks.
Sources: AttriGuard (arXiv:2603.10749), Architecting Secure AI Agents (arXiv:2603.30016), and Reasoning-enabled Task Alignment (arXiv:2606.15441).