The capstone, and the mission’s success criterion: take one lever, apply it to an agent you actually run, and measure that it got more reliable.
You now hold the whole map: the six-move turn, the workflow⇄agent dial and its five patterns (L2), and the four levers we went deep on — context (L3), stop & budget (L4), recovery & state (L5). Knowledge without a shipped change is just fluency. This lesson turns it into storage strength the only way that sticks: by doing the real thing on TradingAgents (or whichever of your agents is shakiest).
You can’t improve what you don’t measure, and you can’t measure without an eval set. The good news from Anthropic: you start small.
Begin with the manual checks you run during development — the behaviors you verify before each release and common tasks end users try. … Start with 20–50 simple tasks drawn from real failures. — Anthropic, Demystifying Evals for AI Agents
Turn your existing manual spot-checks and past incidents into eval tasks. A good task is one where “two domain experts would independently reach the same pass/fail verdict” — unambiguous, and provably solvable.1 Grade outcomes, not the path: checking that the agent took an exact sequence of tool calls is too rigid and penalises valid solutions.1
| Grader | Good for | Watch out |
|---|---|---|
| Code-based | Objective checks: a number, a JSON shape, a P&L sign | Brittle to valid variation |
| Model-as-judge | Open-ended output (a rationale, a summary) | Non-deterministic; calibrate against humans |
| Human | Gold-standard judgement calls | Slow and expensive |
def grade(task, output): # grade the OUTCOME, not the tool path
return task.check(output) # code / model-judge / human
def pass_hat_k(task, k=5): # pass^k: ALL k runs must succeed
runs = [grade(task, agent(task, env=fresh_env())) for _ in range(k)]
return all(runs) # consistency, not one lucky run
score = mean(pass_hat_k(t) for t in eval_set) # <- the number to beat
Read the failing transcripts — you won’t know what’s wrong until you do.1 Then map each failure to the lever that owns it:
| What you see in the transcript | Reach for |
|---|---|
| Wrong route / over-autonomous where code would do | Lever 01 — control flow (L2) |
| Confused by noise; misses key facts late in long runs | Lever 02 — context (L3) |
| Spins, repeats, never stops; cost spikes | Lever 05 — stop & budget (L4) |
| One tool failure derails the whole run | Lever 06 — error recovery (L5) |
Apply one lever’s fix — not five at once, or you won’t know what moved the number. Re-run the eval. If pass^k improved with no regressions, you’ve shipped a measurable reliability win. Then lock it in:
After an agent is launched and optimized, capability evals with high pass rates can “graduate” to become a regression suite that is run continuously to catch any drift. — Anthropic, Demystifying Evals for AI Agents
That graduated regression suite is the quiet payoff of this whole course: it’s the feedback loop that keeps every future change to the loop honest.
This is the lesson’s feedback loop — not a quiz, but the actual moves. Tick each as you do it on your own agent.
TradingAgents failures & pre-release checks.Complete that checklist and you’ve hit MISSION.md’s headline goal: “ships a measurable reliability improvement to one of their own agents.” You didn’t just learn loop engineering — you practised it, on your own loop, with a number to prove it.
The whole course, interleaved. Retrieve from memory.
Anthropic — Demystifying Evals for AI Agents. The practical playbook: building the eval set, grader types, pass@k vs pass^k, isolated trials, and graduating capability evals into regression suites. Read it before you run Step 1. ~20 minutes.
TradingAgents. We’ll draft its first 20 eval tasks, pick graders, read the failing transcripts together, and decide which single lever to turn. This is the exercise the whole course was built for — start it in the chat whenever you’re ready.