Loop Engineering Β· Lesson 5 Β· Levers 06 & 07 β€” Recovery & state
🌐 English version β†’

Error Recovery & Compaction

Levers 06 aur 07: failures normal hain. Sawaal yeh hai ki aapka loop unhe padhta, seekhta aur resume karta hai β€” ya unme doob jaata hai.

Ab tak aap loop ko shape de sakte ho (Lever 01), use achhe se feed kar sakte ho (Lever 02), aur use safely stop kar sakte ho (Lever 05). Yeh lesson is baare mein hai ki kya hota hai jab ek tool call fail hoti hai β€” jo, kisi bhi real agent mein, lagatar hota rehta hai. Error recovery ki skill hi ek aise agent ko, jo khud ko correct karta hai, us agent se alag karti hai jo spiral kar jaata hai.

Errors feedback hain β€” unhe wapas fold karo

Turn ke Move 4 ne results β€œaur errors” capture kiye the. Wo β€œaur” hi poora khel hai. Ek failed tool call ground truth hai, bilkul ek successful call jaisi β€” yeh model ko batati hai ki kya kaam nahi aaya taaki agla turn adjust kar sake. Reliability move yeh hai ki error ko ek observation ki tarah context mein append karo aur loop ko ek aur swing lene do.1 Error ko nigal jao toh model andha ho jaata hai; loop wapas guessing par aa jaata hai.

Par use compact karo β€” dump mat karo

Yeh raha trap. NaΓ―ve fix β€” poora stack trace wapas paste kar do β€” usi lever ko poison kar deta hai jo aapne abhi seekha. Raw errors enormous, low-signal hote hain, aur accumulate hote hain: har retry text ki ek aur deewar dher kar deti hai jab tak window sad na jaaye (Lesson 3).

Anti-pattern: the error dump Har failure par complete stack traces aur raw logs wapas feed karna context ko low-signal tokens se bhar deta hai. Loop apna attention budget reasoning ke bajaye noise padhne mein lagata hai β€” aur aksar aur zyada loop karta hai jab wo aur confuse ho jaata hai.

Discipline yeh hai error compaction: failure ko uski essential, actionable detail tak nichod do isse pehle ki wo window mein dobara enter kare.1 β€œConnectionError: timeout after 30s calling pricing API; retry 2/3” signal carry karta hai; 80-line waala traceback nahi.

Rather than passing full stack traces, errors are distilled into essential details, keeping context window usage efficient while preserving debugging capability. β€” HumanLayer, 12-Factor Agents, Factor 9
compact_error β€” failure ko us ek line tak nichodo jis par loop act kar sake
# βœ— the dump: 80 lines of traceback, har retry par re-paste
Traceback (most recent call last): File "agent.py", line 412 ...

# βœ“ the compaction: sirf signal
def compact_error(e):
    return f"{type(e).__name__}: {short(e)} | retry {e.attempt}/3"
# -> "ConnectionError: pricing API timeout 30s | retry 2/3"
The pattern in one line Catch β†’ compact β†’ append β†’ retry β€” step budget ke andar. Errors chhoti observations ban jaate hain jin par loop act kar sake; Lever 05 ka guard-stop (jaise β€œsame error 3×”) ek doomed retry ko runaway banne se rokta hai.

State & ownership β€” resume ko sasta banao (Lever 07)

Recovery sirf in-loop nahi hoti. Kabhi-kabhi sahi move yeh hai ki pause karo β€” ek human ke liye, ek approval ke liye, ek fix ke liye β€” aur baad mein thread khoye bina resume karo. Iske liye apna state own karna zaroori hai. 12-Factor ka nuskha:

FACTOR 12

Stateless reducer

Ek turn ko ek pure function ki tarah treat karo: (state, input) β†’ new state. Koi hidden side state nahi matlab retries aur replays deterministic aur observable hote hain.1

FACTOR 6

Launch / pause / resume

Agar state ek serialisable object mein rehta hai, toh aap checkpoint kar sakte ho, human ya failure ke liye pause kar sakte ho, aur usi exact point se resume kar sakte ho β€” koi data loss nahi.1

FACTOR 5

Unify execution & business state

Agent ki step history aur real-world outcome ko ek hi jagah rakho taaki wo apart drift na kar sakein β€” trustworthy recovery ki basis.1

Turn ek stateless reducer ki tarah β€” pure, serialisable, replayable
def turn(state, inp):             # (state, input) -> new state; koi hidden side state nahi
    ...
    return new_state             # serialisable -> checkpoint, pause, resume, replay

# human approval ke baad resume β€” kuch nahi khoya
state = load(checkpoint_id)
state = turn(state, human_decision)

Sab milake: kyunki har turn owned, serialisable state par ek pure reduction hai, ek failure kabhi fatal nahi hoti β€” aap turn ko retry kar sakte ho, handoff karke resume kar sakte ho, ya poore run ko ek eval mein replay kar sakte ho. Yahi hai jis par β€œloop khud ko correct karta hai” asal mein tika hua hai.

Your win today

Ab aap failures ko fuel mein badal sakte ho: errors ko ground truth ki tarah append karo, par compacted β€” dump nahi; retries ko guard-stop se bound karo; aur state ko ek stateless reducer ki tarah own karo taaki koi bhi run pause, resume, ya replay ho sake. Levers 06 aur 07, saath mil kar kaam karte hue.

Recall check

Memory se retrieve karo. (Questions Lessons 3 aur 4 ko interleave karte hain.)

Primary source β€” yeh aage padho

HumanLayer β€” 12-Factor Agents. Factor 9 padho (compact errors), phir state par Factors 5, 6 aur 12. Yeh recovery aur ownership ka sabse clear, production-grade account hai. Iske saath context-engineering waala piece pair karo taaki samajh aaye ki compaction window ko kaise protect karta hai.

Main aapka teacher hoon β€” use karo. Apne logs se ek real failure mere paas lao aur hum saath mein uska compact form likhenge. Check karna chahte ho ki aapka TradingAgents loop cleanly pause-and-resume kar sakta hai ya nahi, ya uska execution aur business state drift kar sakta hai? Chat mein poocho.
← Lesson 4: Stop conditions πŸ“– Glossary (English) Aage β†’ Lesson 6: Ship a reliability win

Sources

  1. HumanLayer β€” 12-Factor Agents. Factor 9: compact errors into the context window; Factor 12: stateless reducer; Factor 6: launch/pause/resume; Factor 5: unify execution & business state.
  2. Anthropic β€” Effective Context Engineering. Compaction and tool-result clearing to protect the window.