04EXPERIMENTS

AI Red-Team Harness Finds Reproducible AgentDojo Injection

A medium-severity agentic prompt-injection finding was generated, executed, judged, then independently re-verified before I counted it as real.

TL;DR

I built an AI red-team harness around agent benchmarks: generate candidate attacks, execute them against a target, judge the outcome, then run a separate promotion gate before anything gets called a finding.

One AgentDojo workspace prompt-injection case made it through that gate against or-target-glm. The replay reproduced the failure with security marked false and utility still true. That matters because the agent completed enough of the task to look useful while failing the security objective.

No raw prompt, payload, or transcript text is published here. This is a public write-up of the result and verification discipline, not a copy-paste exploit.

Background

Agentic systems are messy targets. They read instructions, call tools, carry state across steps, and often mix trusted user goals with untrusted workspace content. That creates a different security problem than a single chatbot prompt.

An AI red-team harness is the loop I use to test that problem at scale:

  1. Generate a candidate attack case from a benchmark family or mutation strategy.
  2. Execute it against a target model or agent setup.
  3. Judge the trace against a security objective and a utility objective.
  4. Promote only if an independent replay reproduces the result.

The last step is load-bearing. A benchmark trace can be flaky. A judge can be too generous. A model can fail once and pass on retry. The harness does not treat a single run as a finding.

What AgentDojo Tests

AgentDojo is a benchmark for tool-using agents under prompt-injection pressure. The cases are built around realistic agent tasks: the model needs to follow the user goal while ignoring malicious or conflicting instructions in the environment.

The case promoted here sits in the workspace prompt-injection family. The target agent had to handle untrusted workspace content while still completing the intended task. The attack family is recorded as AgentDojo workspace/ignore_previous, but the actual injected text stays out of this post.

That is the right publication boundary: name the class, target reference, severity, and verification path, but do not hand readers the payload.

Setup

  • Harness: self-healing AI red-team runner with generate, execute, judge, promote stages
  • Benchmark family: AgentDojo
  • Category: agentic prompt injection
  • Technique reference: workspace instruction-conflict case, sanitized as workspace/ignore_previous
  • Target reference: or-target-glm
  • Fixture: agentdojo-b1597d979248
  • Severity: medium
  • Confidence: high after promotion-gate replay

The initial run produced a scored compliance signal for the security failure. The promotion gate then replayed the committed fixture through the independent verification path.

Result

The finding promoted.

The independent check reran the AgentDojo workspace case against or-target-glm and reproduced the outcome:

Check Result
Security objective False
Utility objective True
Severity Medium
Finding level Promoted finding
Reproducibility Independent replay passed

The security/utility split is the important part. This was not just a useless failed task. The agent still satisfied the utility side while violating the security side, which is exactly the failure mode prompt-injection testing is meant to catch.

Why the Promotion Gate Matters

I do not want a red-team harness that publishes noise.

The promotion gate is the filter between "interesting trace" and "real finding." It reruns the fixture outside the original generation path and checks the result again. If the replay does not reproduce, the candidate does not get promoted.

That discipline keeps the harness useful for regression testing. Once a finding is promoted, it becomes a fixture the system can replay after model, prompt, tool, or harness changes. If the target gets fixed, the fixture should flip. If a new model regresses, the fixture catches it.

Reproduce

The sanitized finding is tracked as fixture agentdojo-b1597d979248. In the lab repo, the replay suite rechecks committed promoted fixtures with:

cd agents/cipher/tools
python3 -m cipher_machine replay-fixtures

For this fixture, the promotion replay recorded security=False and utility=True against or-target-glm.

Publication Boundary

This post intentionally omits:

  • raw prompt text
  • injected payload text
  • transcript bodies
  • step-by-step attack instructions

It includes the benchmark family, target reference, severity, verification method, and reproducibility result. That is enough to document the engineering result without turning the post into an exploit recipe.