Loop Engineering · Lesson 5 · Levers 06 & 07 — Recovery & state

Error Recovery & Compaction

Levers 06 and 07: failures are normal. The question is whether your loop reads them, learns, and resumes — or drowns in them.

By now you can shape the loop (Lever 01), feed it well (Lever 02), and stop it safely (Lever 05). This lesson is about what happens when a tool call fails — which, in any real agent, is constantly. The skill of error recovery is what separates an agent that self-corrects from one that spirals.

Errors are feedback — fold them back in

Move 4 of the turn captured results “and errors.” That “and” is the whole game. A failed tool call is ground truth, same as a successful one — it tells the model what didn’t work so the next turn can adjust. The reliability move is to append the error to context as an observation and let the loop take another swing.1 Swallow the error and the model is blind; the loop is back to guessing.

But compact it — don’t dump it

Here’s the trap. The naïve fix — paste the full stack trace back in — poisons the very lever you just learned. Raw errors are enormous, low-signal, and they accumulate: every retry stacks another wall of text until the window rots (Lesson 3).

Anti-pattern: the error dump Feeding back complete stack traces and raw logs on every failure floods the context with low-signal tokens. The loop spends its attention budget reading noise instead of reasoning — and often loops harder as it gets more confused.

The discipline is error compaction: distil the failure to its essential, actionable detail before it re-enters the window.1 “ConnectionError: timeout after 30s calling pricing API; retry 2/3” carries the signal; the 80-line traceback does not.

Rather than passing full stack traces, errors are distilled into essential details, keeping context window usage efficient while preserving debugging capability. — HumanLayer, 12-Factor Agents, Factor 9

compact_error — distil the failure to the one line the loop can act on

# ✗ the dump: 80 lines of traceback, re-pasted on every retry
Traceback (most recent call last): File "agent.py", line 412 ...

# ✓ the compaction: signal only
def compact_error(e):
    return f"{type(e).__name__}: {short(e)} | retry {e.attempt}/3"
# -> "ConnectionError: pricing API timeout 30s | retry 2/3"

The pattern in one line Catch → compact → append → retry — under the step budget. Errors become short observations the loop can act on; Lever 05’s guard-stop (e.g. “same error 3×”) keeps a doomed retry from becoming a runaway.

State & ownership — make resume cheap (Lever 07)

Recovery isn’t only in-loop. Sometimes the right move is to pause — for a human, an approval, a fix — and resume later without losing the thread. That requires owning your state. 12-Factor’s prescription:

FACTOR 12

Stateless reducer

Treat a turn as a pure function: (state, input) → new state. No hidden side state means retries and replays are deterministic and observable.1

FACTOR 6

Launch / pause / resume

If state lives in a serialisable object, you can checkpoint, pause for a human or a failure, and resume from the exact point — no data loss.1

FACTOR 5

Unify execution & business state

Keep the agent’s step history and the real-world outcome in one place so they can’t drift apart — the basis for trustworthy recovery.1

The turn as a stateless reducer — pure, serialisable, replayable

def turn(state, inp):             # (state, input) -> new state; no hidden side state
    ...
    return new_state             # serialisable -> checkpoint, pause, resume, replay

# resume after a human approval — nothing lost
state = load(checkpoint_id)
state = turn(state, human_decision)

Put together: because each turn is a pure reduction over owned, serialisable state, a failure is never fatal — you can retry the turn, hand off and resume, or replay the whole run in an eval. That is what “the loop self-corrects” actually rests on.

Your win today

You can now turn failures into fuel: append errors as ground truth, but compacted not dumped; bound retries with the guard-stop; and own state as a stateless reducer so any run can pause, resume, or replay. Levers 06 and 07, working together.

Recall check

Retrieve from memory. (Questions interleave Lessons 3 and 4.)

Primary source — read this next

HumanLayer — 12-Factor Agents. Read Factor 9 (compact errors), then Factors 5, 6 and 12 on state. This is the clearest production-grade account of recovery and ownership. Pair with the context-engineering piece for how compaction protects the window.

I’m your teacher — use me. Bring me a real failure from your logs and we’ll write its compact form together. Want to check whether your TradingAgents loop could pause-and-resume cleanly, or whether its execution and business state can drift? Ask in the chat.

← Lesson 4: Stop conditions 📖 Glossary Next → Lesson 6: Ship a reliability win

Sources

HumanLayer — 12-Factor Agents. Factor 9: compact errors into the context window; Factor 12: stateless reducer; Factor 6: launch/pause/resume; Factor 5: unify execution & business state.
Anthropic — Effective Context Engineering. Compaction and tool-result clearing to protect the window.