Loop Engineering · Lesson 6 · Capstone — apply it

Ship a Reliability Win

The capstone, and the mission’s success criterion: take one lever, apply it to an agent you actually run, and measure that it got more reliable.

You now hold the whole map: the six-move turn, the workflow⇄agent dial and its five patterns (L2), and the four levers we went deep on — context (L3), stop & budget (L4), recovery & state (L5). Knowledge without a shipped change is just fluency. This lesson turns it into storage strength the only way that sticks: by doing the real thing on TradingAgents (or whichever of your agents is shakiest).

The trap this lesson exists to avoid “I tweaked the prompt and it feels better” is not a reliability win — it’s a vibe. Reliability is a measured property. So the rule of this capstone: no change without a number that moved.

Step 1 — Measure before you touch (build a tiny eval)

You can’t improve what you don’t measure, and you can’t measure without an eval set. The good news from Anthropic: you start small.

Begin with the manual checks you run during development — the behaviors you verify before each release and common tasks end users try. … Start with 20–50 simple tasks drawn from real failures. — Anthropic, Demystifying Evals for AI Agents

Turn your existing manual spot-checks and past incidents into eval tasks. A good task is one where “two domain experts would independently reach the same pass/fail verdict” — unambiguous, and provably solvable.1 Grade outcomes, not the path: checking that the agent took an exact sequence of tool calls is too rigid and penalises valid solutions.1

Pick a grader that fits the task
Grader	Good for	Watch out
Code-based	Objective checks: a number, a JSON shape, a P&L sign	Brittle to valid variation
Model-as-judge	Open-ended output (a rationale, a summary)	Non-deterministic; calibrate against humans
Human	Gold-standard judgement calls	Slow and expensive

Reliability is consistency, not one lucky run For a trading agent, one good run proves nothing. Measure pass^k — the probability that all k trials succeed — because customer-facing consistency is the requirement. (Use pass@k, success in at least one of k, only when a single success suffices.)1 Run each trial from a clean, isolated environment so leftover state can’t cause correlated passes or failures.1

A tiny eval harness — your baseline number, from k isolated runs

def grade(task, output):                 # grade the OUTCOME, not the tool path
    return task.check(output)            # code / model-judge / human

def pass_hat_k(task, k=5):               # pass^k: ALL k runs must succeed
    runs = [grade(task, agent(task, env=fresh_env())) for _ in range(k)]
    return all(runs)                     # consistency, not one lucky run

score = mean(pass_hat_k(t) for t in eval_set)   # <- the number to beat

Step 2 — Diagnose which lever is weakest

Read the failing transcripts — you won’t know what’s wrong until you do.1 Then map each failure to the lever that owns it:

Symptom → lever
What you see in the transcript	Reach for
Wrong route / over-autonomous where code would do	Lever 01 — control flow (L2)
Confused by noise; misses key facts late in long runs	Lever 02 — context (L3)
Spins, repeats, never stops; cost spikes	Lever 05 — stop & budget (L4)
One tool failure derails the whole run	Lever 06 — error recovery (L5)

Step 3 — Change one lever, re-measure, graduate

Apply one lever’s fix — not five at once, or you won’t know what moved the number. Re-run the eval. If pass^k improved with no regressions, you’ve shipped a measurable reliability win. Then lock it in:

After an agent is launched and optimized, capability evals with high pass rates can “graduate” to become a regression suite that is run continuously to catch any drift. — Anthropic, Demystifying Evals for AI Agents

That graduated regression suite is the quiet payoff of this whole course: it’s the feedback loop that keeps every future change to the loop honest.

Your real-world checklist

This is the lesson’s feedback loop — not a quiz, but the actual moves. Tick each as you do it on your own agent.

Collect 20–50 tasks from real TradingAgents failures & pre-release checks.
Grade outcomes, not tool-call paths; pick code / model / human per task.
Baseline pass^k from clean, isolated runs — write the number down.
Read the failing transcripts; map the dominant failure to one lever.
Change exactly one lever’s knob (context, budget, recovery, or control flow).
Re-measure pass^k; confirm it rose with no regressions.
Graduate the passing tasks into a continuous regression suite.

Your win — and the mission

Complete that checklist and you’ve hit MISSION.md’s headline goal: “ships a measurable reliability improvement to one of their own agents.” You didn’t just learn loop engineering — you practised it, on your own loop, with a number to prove it.

Recall check

The whole course, interleaved. Retrieve from memory.

Primary source — read this next

Anthropic — Demystifying Evals for AI Agents. The practical playbook: building the eval set, grader types, pass@k vs pass^k, isolated trials, and graduating capability evals into regression suites. Read it before you run Step 1. ~20 minutes.

I’m your teacher — use me, for real this time. Bring me TradingAgents. We’ll draft its first 20 eval tasks, pick graders, read the failing transcripts together, and decide which single lever to turn. This is the exercise the whole course was built for — start it in the chat whenever you’re ready.

← Lesson 5: Error recovery 📖 Glossary ↩ Back to the Map

Sources

Anthropic — Demystifying Evals for AI Agents. Start with 20–50 tasks from real failures; two-expert agreement; grade outcomes not paths; code / model / human graders; pass@k vs pass^k; isolated trials; read transcripts; graduate capability evals into regression suites.
HumanLayer — 12-Factor Agents. Small, focused agents as the unit you evaluate and improve.