A year ago, Claude could barely string together a few bash commands without getting tangled in its own escapes. Today, almost all of Claude Code is written by Claude Code, and the agent can run productively for days at a stretch. The model got better, yes. But the bigger unlock is the harness — the scaffolding around the model that lets it survive context rot, lazy self-judgment, and the slow drift that ruins long runs.
Ash Prabaker and Andrew Wilson from Anthropic's Applied AI team walked through how that harness actually works. The summary is short and worth memorising.
Three things break long-running agents. Context windows are finite, so models develop "context anxiety" near the end and rush. They are mediocre at planning, so they half-build features and stop. And — most damaging — they are terrible at judging their own output. A model will look at a button with no backend behind it and confidently declare the feature done.
Throwing one model at all three problems is what most people do. It is the wrong shape.
The fix is adversarial. Split the work into three roles, each with its own context window and its own system prompt.
- Planner — takes a one-line ask ("build a retro game maker") and converts it into a high-level spec. Not granular technical details — those cascade errors. Just the outer lines of what the product should be.
- Generator — writes the code.
- Evaluator — opens the live app in Playwright, clicks around, screenshots, grades against an explicit rubric, and hands back a critique.
The clever bit is the contract. Before the generator writes a line, it negotiates with the evaluator over a markdown file on disk: I will build X, and you will verify it by testing Y. The evaluator pushes back — too vague, missing edge cases, scope too big — until they agree. Then build starts. Then the evaluator grades against the contract the two of them wrote, not the original spec.
This sounds like a PM/IC/QA team. That is exactly the point. Anthropic did not invent the pattern. They just gave each role its own clean context.
But if the evaluator is also an LLM, why doesn't it rubber-stamp everything? Because tuning a standalone critic to be harsh is tractable. Tuning a generator to be self-critical is not. Same gap as humans — it is easy to critique a meal, much harder to cook one. The harness exploits that gap.
The proof is a "build a retro game maker" prompt run two ways. Solo Claude Code produced something that looked fine on the opening screen — palette, canvas, frame timeline — but the moment you pressed an arrow key in play mode, nothing happened. No physics loop. No collisions. The model had no idea what playing a game actually meant.
Run the same prompt through the planner/generator/evaluator harness for six hours and $200. It named itself RetroForge. It generated a 54-colour palette. It built an AI level-assistant — a feature the prompt never mentioned, that the planner decided products like this should have. Play mode actually worked: physics, collisions, a debug HUD in the corner with live numbers because the evaluator needed those numbers to test the game. The evaluator caught FastAPI route-ordering bugs that pass unit tests but break in prod. Twenty-seven contract criteria, every one of them granular enough to be actionable.
The harness was the entire difference.
A few rules that fall out of all this:
- You can grade taste. Most people say you can't, so they don't try. Anthropic writes a rubric — design, originality, craft, functionality, weighted by which model is in play — and shows the evaluator reference images of "this is good" and "this is AI slop." Taste converges on whoever sets the rubric. Stop pretending subjective quality is ungradable.
- Compaction is not coherence. Lossy summaries drift. Structured handoffs through files on disk beat shoving everything into one context window.
- The pivot is the magic. A single Claude Code session keeps patching the same broken thing. A generator/evaluator pair will throw the whole codebase away after ten failed passes and start clean. That is something the generator alone almost never does about its own work.
- The harness evolves as the model does. Sonnet 3.7 needed aggressive context resetting and forced sprint decomposition. Opus 4.6 holds two-hour continuous builds coherently and didn't need either. The right question is not "is harness design dead?" — it is "which scaffolding can I delete this model generation?"
- Read the traces by hand. Not summaries. Not dashboards. The actual logs. Empathising with what the model saw is the only way to know where its judgment diverged from yours. Then tune the prompt for exactly that gap.
The unsexy meta-lesson is that the frontier of agent design is not the model. It is the willingness to write down what "done" means in granular, harsh, opinionated detail — and to let two instances of the same model fight each other over whether you got there.
Self-evaluation is a trap. Use an adversary.
Source: Build Agents That Run for Hours (Without Losing the Plot) — Ash Prabaker & Andrew Wilson, Anthropic
No comments:
Post a Comment