Your prompt-injection eval is lying if the attacker cannot adapt

The ActPass Team

Security & Product

A prompt-injection defense that passes a fixed benchmark has proved one thing: it recognizes that benchmark. AutoDojo and The Attacker Moves Second are useful because they make the evaluation more honest. The attacker is allowed to see what works, adjust, and spend budget searching for a bypass.

If the attacker cannot adapt in your eval, your eval is measuring pattern recall, not security.

Static benchmark

known bad strings

Adaptive attacker

mutates until accepted

ActPass gate

blocks the side effect

Detectors can lose the text game; deterministic authorization still stops the action.

The failure mode

A classifier learns phrases like ignore previous instructions.
A red-team loop mutates the payload until the classifier accepts it.
The agent still has a live tool path to email, refund, delete, deploy, or exfiltrate.
The benchmark was green; production still executed the side effect.

The eval ActPass cares about

test("adaptive attack cannot execute the action", () => {
  const session = seen("untrusted_input", "sensitive_access");
  const action = tool("email.send", { to: "outside@example.com" });

  expect(authorize({ session, action })).toEqual({
    status: "deny",
    reason: "rule_of_two_external_action",
  });
});

This test does not care whether the payload is English, JSON, base64, hidden HTML, or an image caption. The only question is whether the proposed action would complete the dangerous capability chain.

Use model defenses as sensors, not gates

Guard models still help. They can label suspicious observations, summarize attack goals, and feed evidence into the approval screen. They should not be the final authority for a payment, deploy, permission grant, or outbound message. ActPass puts that last decision in code with signed scope, replay checks, and a ledger.

Sources: AutoDojo (arXiv:2606.15057), PISmith (arXiv:2603.13026), Learning to Inject (arXiv:2602.05746), Assessing Automated Prompt Injection Attacks (arXiv:2606.10525), and The Attacker Moves Second (arXiv:2510.09023).

See your agents' exposure

Get a read-only Lethal-Trifecta / MCP-color report for your agents in under a minute. No runtime, nothing blocked — just the truth about your blast radius.

Get your exposure report Read the docs

Your prompt-injection eval is lying if the attacker cannot adapt

The ActPass Team

Security & Product

If the attacker cannot adapt in your eval, your eval is measuring pattern recall, not security.

Static benchmark

known bad strings

Adaptive attacker

mutates until accepted

ActPass gate

blocks the side effect

Detectors can lose the text game; deterministic authorization still stops the action.

The failure mode

A classifier learns phrases like ignore previous instructions.
A red-team loop mutates the payload until the classifier accepts it.
The agent still has a live tool path to email, refund, delete, deploy, or exfiltrate.
The benchmark was green; production still executed the side effect.

The eval ActPass cares about

test("adaptive attack cannot execute the action", () => {
  const session = seen("untrusted_input", "sensitive_access");
  const action = tool("email.send", { to: "outside@example.com" });

  expect(authorize({ session, action })).toEqual({
    status: "deny",
    reason: "rule_of_two_external_action",
  });
});

This test does not care whether the payload is English, JSON, base64, hidden HTML, or an image caption. The only question is whether the proposed action would complete the dangerous capability chain.

Use model defenses as sensors, not gates

See your agents' exposure

Get a read-only Lethal-Trifecta / MCP-color report for your agents in under a minute. No runtime, nothing blocked — just the truth about your blast radius.

Get your exposure report Read the docs

Your prompt-injection eval is lying if the attacker cannot adapt

The failure mode

The eval ActPass cares about

Use model defenses as sensors, not gates

See your agents' exposure

Keep reading

Your prompt-injection eval is lying if the attacker cannot adapt

The failure mode

The eval ActPass cares about

Use model defenses as sensors, not gates

See your agents' exposure

Keep reading