A prompt-injection defense that passes a fixed benchmark has proved one thing: it recognizes that benchmark. AutoDojo and The Attacker Moves Second are useful because they make the evaluation more honest. The attacker is allowed to see what works, adjust, and spend budget searching for a bypass.
The failure mode
- A classifier learns phrases like ignore previous instructions.
- A red-team loop mutates the payload until the classifier accepts it.
- The agent still has a live tool path to email, refund, delete, deploy, or exfiltrate.
- The benchmark was green; production still executed the side effect.
The eval ActPass cares about
test("adaptive attack cannot execute the action", () => {
const session = seen("untrusted_input", "sensitive_access");
const action = tool("email.send", { to: "outside@example.com" });
expect(authorize({ session, action })).toEqual({
status: "deny",
reason: "rule_of_two_external_action",
});
});This test does not care whether the payload is English, JSON, base64, hidden HTML, or an image caption. The only question is whether the proposed action would complete the dangerous capability chain.
Use model defenses as sensors, not gates
Guard models still help. They can label suspicious observations, summarize attack goals, and feed evidence into the approval screen. They should not be the final authority for a payment, deploy, permission grant, or outbound message. ActPass puts that last decision in code with signed scope, replay checks, and a ledger.
Sources: AutoDojo (arXiv:2606.15057), PISmith (arXiv:2603.13026), Learning to Inject (arXiv:2602.05746), Assessing Automated Prompt Injection Attacks (arXiv:2606.10525), and The Attacker Moves Second (arXiv:2510.09023).