Loop Engineering Β· Lesson 6 Β· Capstone β€” apply it
🌐 English version β†’

Ship a Reliability Win

Capstone, aur mission ka success criterion: ek lever lo, use ek aise agent par apply karo jo aap sach mein chalate ho, aur measure karo ki wo zyada reliable ho gaya.

Ab aapke paas poora map hai: chhe-move turn, workflow⇄agent dial aur uske paanch patterns (L2), aur wo chaar levers jinme hum deep gaye β€” context (L3), stop & budget (L4), recovery & state (L5). Bina kisi shipped change ke knowledge sirf fluency hai. Yeh lesson use storage strength mein badalta hai us ek tareeke se jo tikta hai: TradingAgents par real kaam karke (ya aapke jo bhi agent sabse shaky hai us par).

Yeh lesson jis trap se bachne ke liye hai β€œMaine prompt tweak kiya aur ab yeh feel behtar hota hai” reliability win nahi hai β€” wo ek vibe hai. Reliability ek measured property hai. Toh is capstone ka rule: bina ek aise number ke koi change nahi jo move hua ho.

Step 1 β€” Touch karne se pehle measure karo (ek tiny eval banao)

Jo cheez aap measure nahi karte usse improve nahi kar sakte, aur bina eval set ke measure nahi kar sakte. Anthropic se good news: aap chhote se shuru karte ho.

Begin with the manual checks you run during development β€” the behaviors you verify before each release and common tasks end users try. … Start with 20–50 simple tasks drawn from real failures. β€” Anthropic, Demystifying Evals for AI Agents

Apne existing manual spot-checks aur past incidents ko eval tasks mein badlo. Ek accha task wo hai jahan β€œtwo domain experts would independently reach the same pass/fail verdict” β€” unambiguous, aur provably solvable.1 Grade karo outcomes, na ki path: yeh check karna ki agent ne tool calls ka exact sequence liya, bahut rigid hai aur valid solutions ko penalise karta hai.1

Task ke hisaab se grader chuno
GraderKiske liye acchaDhyaan rakho
Code-basedObjective checks: ek number, ek JSON shape, ek P&L signValid variation ke saamne brittle
Model-as-judgeOpen-ended output (ek rationale, ek summary)Non-deterministic; humans ke against calibrate karo
HumanGold-standard judgement callsSlow aur expensive
Reliability consistency hai, ek lucky run nahi Ek trading agent ke liye, ek good run kuch prove nahi karta. pass^k measure karo β€” wo probability ki saare k trials succeed ho β€” kyunki customer-facing consistency hi requirement hai. (pass@k use karo, k mein se kam-se-kam ek mein success, sirf tab jab ek single success kaafi ho.)1 Har trial ko ek clean, isolated environment se chalao taaki leftover state correlated passes ya failures na laa sake.1
Ek tiny eval harness β€” aapka baseline number, k isolated runs se
def grade(task, output):                 # OUTCOME grade karo, tool path nahi
    return task.check(output)            # code / model-judge / human

def pass_hat_k(task, k=5):               # pass^k: SAARE k runs succeed hone chahiye
    runs = [grade(task, agent(task, env=fresh_env())) for _ in range(k)]
    return all(runs)                     # consistency, ek lucky run nahi

score = mean(pass_hat_k(t) for t in eval_set)   # <- jo number beat karna hai

Step 2 β€” Diagnose karo ki kaunsa lever sabse weak hai

Failing transcripts padho β€” jab tak padhoge nahi, pata nahi chalega ki kya galat hai.1 Phir har failure ko us lever se map karo jo uska owner hai:

Symptom β†’ lever
Transcript mein aap kya dekhte hoKiske liye jaao
Galat route / over-autonomous jahan code kaafi hotaLever 01 β€” control flow (L2)
Noise se confused; long runs mein late aane wale key facts miss karta haiLever 02 β€” context (L3)
Spin karta hai, repeat karta hai, kabhi rukta nahi; cost spikeLever 05 β€” stop & budget (L4)
Ek tool failure poore run ko derail kar deta haiLever 06 β€” error recovery (L5)

Step 3 β€” Ek lever change karo, re-measure karo, graduate karo

Ek lever ka fix apply karo β€” paanch ek saath nahi, warna pata nahi chalega ki number kisne move kiya. Eval dobara chalao. Agar pass^k bina kisi regression ke improve hua, toh aapne ek measurable reliability win ship kar di. Phir use lock in karo:

After an agent is launched and optimized, capability evals with high pass rates can β€œgraduate” to become a regression suite that is run continuously to catch any drift. β€” Anthropic, Demystifying Evals for AI Agents

Wo graduated regression suite hi is poore course ka quiet payoff hai: yeh wo feedback loop hai jo loop par hone wale har future change ko honest rakhta hai.

Aapki real-world checklist

Yeh lesson ka feedback loop hai β€” quiz nahi, balki actual moves. Jaise-jaise aap apne agent par karte jaao, har ek ko tick karte jaao.

Aapki win β€” aur mission

Wo checklist complete karo aur aapne MISSION.md ka headline goal hit kar liya: β€œships a measurable reliability improvement to one of their own agents.” Aapne sirf loop engineering seekhi nahi β€” aapne use practise kiya, apne hi loop par, ek number ke saath jo isse prove karta hai.

Recall check

Poora course, interleaved. Memory se retrieve karo.

Primary source β€” yeh aage padho

Anthropic β€” Demystifying Evals for AI Agents. Practical playbook: eval set banana, grader types, pass@k vs pass^k, isolated trials, aur capability evals ko regression suites mein graduate karna. Step 1 chalane se pehle ise padho. ~20 minute.

Main aapka teacher hoon β€” use karo, is baar sach mein. Mere paas TradingAgents laao. Hum uske pehle 20 eval tasks draft karenge, graders chunenge, failing transcripts saath mein padhenge, aur decide karenge ki kaunsa single lever ghumaana hai. Yeh wahi exercise hai jiske liye poora course banaya gaya tha β€” jab ready ho, chat mein shuru kar do.
← Lesson 5: Error recovery πŸ“– Glossary (English) ↩ Wapas Map par

Sources

  1. Anthropic β€” Demystifying Evals for AI Agents. Start with 20–50 tasks from real failures; two-expert agreement; grade outcomes not paths; code / model / human graders; pass@k vs pass^k; isolated trials; read transcripts; graduate capability evals into regression suites.
  2. HumanLayer β€” 12-Factor Agents. Small, focused agents as the unit you evaluate and improve.