Capstone, aur mission ka success criterion: ek lever lo, use ek aise agent par apply karo jo aap sach mein chalate ho, aur measure karo ki wo zyada reliable ho gaya.
Ab aapke paas poora map hai: chhe-move turn, workflowβagent dial aur uske paanch patterns (L2), aur wo chaar levers jinme hum deep gaye β context (L3), stop & budget (L4), recovery & state (L5). Bina kisi shipped change ke knowledge sirf fluency hai. Yeh lesson use storage strength mein badalta hai us ek tareeke se jo tikta hai: TradingAgents par real kaam karke (ya aapke jo bhi agent sabse shaky hai us par).
Jo cheez aap measure nahi karte usse improve nahi kar sakte, aur bina eval set ke measure nahi kar sakte. Anthropic se good news: aap chhote se shuru karte ho.
Begin with the manual checks you run during development β the behaviors you verify before each release and common tasks end users try. β¦ Start with 20β50 simple tasks drawn from real failures. β Anthropic, Demystifying Evals for AI Agents
Apne existing manual spot-checks aur past incidents ko eval tasks mein badlo. Ek accha task wo hai jahan βtwo domain experts would independently reach the same pass/fail verdictβ β unambiguous, aur provably solvable.1 Grade karo outcomes, na ki path: yeh check karna ki agent ne tool calls ka exact sequence liya, bahut rigid hai aur valid solutions ko penalise karta hai.1
| Grader | Kiske liye accha | Dhyaan rakho |
|---|---|---|
| Code-based | Objective checks: ek number, ek JSON shape, ek P&L sign | Valid variation ke saamne brittle |
| Model-as-judge | Open-ended output (ek rationale, ek summary) | Non-deterministic; humans ke against calibrate karo |
| Human | Gold-standard judgement calls | Slow aur expensive |
def grade(task, output): # OUTCOME grade karo, tool path nahi
return task.check(output) # code / model-judge / human
def pass_hat_k(task, k=5): # pass^k: SAARE k runs succeed hone chahiye
runs = [grade(task, agent(task, env=fresh_env())) for _ in range(k)]
return all(runs) # consistency, ek lucky run nahi
score = mean(pass_hat_k(t) for t in eval_set) # <- jo number beat karna hai
Failing transcripts padho β jab tak padhoge nahi, pata nahi chalega ki kya galat hai.1 Phir har failure ko us lever se map karo jo uska owner hai:
| Transcript mein aap kya dekhte ho | Kiske liye jaao |
|---|---|
| Galat route / over-autonomous jahan code kaafi hota | Lever 01 β control flow (L2) |
| Noise se confused; long runs mein late aane wale key facts miss karta hai | Lever 02 β context (L3) |
| Spin karta hai, repeat karta hai, kabhi rukta nahi; cost spike | Lever 05 β stop & budget (L4) |
| Ek tool failure poore run ko derail kar deta hai | Lever 06 β error recovery (L5) |
Ek lever ka fix apply karo β paanch ek saath nahi, warna pata nahi chalega ki number kisne move kiya. Eval dobara chalao. Agar pass^k bina kisi regression ke improve hua, toh aapne ek measurable reliability win ship kar di. Phir use lock in karo:
After an agent is launched and optimized, capability evals with high pass rates can βgraduateβ to become a regression suite that is run continuously to catch any drift. β Anthropic, Demystifying Evals for AI Agents
Wo graduated regression suite hi is poore course ka quiet payoff hai: yeh wo feedback loop hai jo loop par hone wale har future change ko honest rakhta hai.
Yeh lesson ka feedback loop hai β quiz nahi, balki actual moves. Jaise-jaise aap apne agent par karte jaao, har ek ko tick karte jaao.
TradingAgents failures & pre-release checks se.Wo checklist complete karo aur aapne MISSION.md ka headline goal hit kar liya: βships a measurable reliability improvement to one of their own agents.β Aapne sirf loop engineering seekhi nahi β aapne use practise kiya, apne hi loop par, ek number ke saath jo isse prove karta hai.
Poora course, interleaved. Memory se retrieve karo.
Anthropic β Demystifying Evals for AI Agents. Practical playbook: eval set banana, grader types, pass@k vs pass^k, isolated trials, aur capability evals ko regression suites mein graduate karna. Step 1 chalane se pehle ise padho. ~20 minute.
TradingAgents laao. Hum uske pehle 20 eval tasks draft karenge, graders chunenge, failing transcripts saath mein padhenge, aur decide karenge ki kaunsa single lever ghumaana hai. Yeh wahi exercise hai jiske liye poora course banaya gaya tha β jab ready ho, chat mein shuru kar do.