Loop Engineering: Designing Systems That Prompt Your Agents For You

"I don't prompt Claude anymore. I have loops running that prompt Claude and figuring out what to do. My job is to write loops." — Boris Cherny, Head of Claude Code at Anthropic

That quote launched a new term into the developer vocabulary: loop engineering. It captures a shift that's been building for two years — from crafting individual prompts to designing autonomous systems that run your AI agents while you sleep.

If you've been using AI coding agents, you've felt the ceiling: no matter how good your prompts are, you are the bottleneck. You type, you wait, you read, you type the next thing. The model is fast. You're not. The constraint has moved from "can the model do it?" to "can I keep up with feeding it work?"

Loop engineering removes you from that per-turn loop. You design autonomous systems that prompt your AI agents on a schedule or trigger — no human typing each instruction by hand. Your role shifts from operator to architect.

The term was coined in June 2026 by Addy Osmani (Google) and Boris Cherny (Anthropic). Peter Steinberger captured it well: "You shouldn't be prompting coding agents anymore. You should be designing loops that prompt your agents."

In this post, we'll trace the evolution from prompt engineering to loop engineering, break down what a loop actually is, walk through production patterns, and show a real example — then cover the risks that make this harder than it sounds.

The Four Paradigms: How We Got Here#

Working with AI has evolved through four distinct paradigms, each moving the developer further from the per-turn loop and closer to designing autonomous systems.

The Four Paradigms of Working with AI
Rendering diagram...

Prompt Engineering (2022–2024)#

The era of the perfect instruction. You craft few-shot examples, chain-of-thought triggers, careful wording. Every interaction requires your direct input. You optimize one call at a time — no persistence, no tools, no memory.

Your leverage point is the quality of what you say to the model.

The analogy: writing the perfect email to a smart but amnesiac assistant — every single time.

Context Engineering (2024–2025)#

The realization that what the model sees matters more than what you say. You build RAG pipelines, structure system prompts, manage retrieval strategies, design memory systems. A mediocre prompt with excellent context beats a perfect prompt with no context.

Your leverage point is the quality of information flow to the model — background knowledge, relevant files, conversation history.

The analogy: setting up a well-organized workspace for your assistant — filing cabinets, reference materials, tools within reach. But you still hand them each task personally.

Harness Engineering (2025–2026)#

The focus shifts to the environment the agent operates within. Tool definitions, permission boundaries, verification gates, CLAUDE.md files, agent personas, sandboxed execution. You shape how the agent behaves, what it can access, and how it validates its own work — all within a single run.

A well-harnessed agent with mediocre prompts outperforms a free-roaming agent with brilliant prompts. The harness IS the product.

Your leverage point is the structure and constraints around the agent — its operational boundaries.

The analogy: building a workbench with specialized jigs, guides, and safety stops. Your assistant does better work because the environment prevents mistakes. But you still bring them each job.

Loop Engineering (2026+)#

You design autonomous systems that find work, prompt agents, verify results, and persist state — without human input at each step. Scheduling, work discovery, state management, maker-checker separation, escalation paths, trust levels, cost control.

You design the loop once. Then the loop runs the agents. You review outcomes, not individual steps.

Your leverage point is the design of the autonomous cycle itself — how work is discovered, dispatched, verified, and completed without you in the per-turn loop.

The analogy: building a factory floor — conveyor belts, quality inspection stations, routing logic — where work flows through automatically. You designed it, you monitor it, but you're not on the line.

Comparison#

DimensionPrompt Eng.Context Eng.Harness Eng.Loop Eng.
Human roleOperatorCuratorEnvironment designerSystems architect
Per-turn input?Yes, every turnYes, session startYes, session startNo — system initiates
ScopeSingle callSession / conversationSingle agent runContinuous autonomous cycles
What you optimizeThe prompt textThe information flowThe agent's constraintsThe autonomous system
Key artifactPrompt templateRAG pipeline, memoryRules files, tool configScheduled loops, state files
Failure modeBad outputWrong contextWrong behaviorRunaway automation

These Are Layers, Not Replacements#

Each paradigm builds on the previous ones. You still need good prompts inside good context inside a well-designed harness inside your loops. Loop engineering doesn't eliminate the others — it's the outermost layer. A loop with a bad harness is just fast bad output at scale.

What Is a Loop, Exactly?#

Now that we see where loop engineering sits in the evolution, let's define what a loop actually is.

A loop is a recursive goal where you define a purpose and the system iterates — using agents, verification, and external state — until the goal is complete or it hands off to a human. The key distinction: the loop prompts the agent, not you.

Loop vs. Agent Run#

An agent run is a single session: you give it a task, it works, it finishes. A loop is the system above: it decides when to start agent runs, what to feed them, how to verify their output, and what to do next.

The agent is the worker. The loop is the manager.

Harness vs. Loop#

These two get conflated, but they sit at different levels:

HarnessLoop
What it isThe environment a single agent runs insideThe system one floor above that orchestrates many runs
ScopeOne agent sessionContinuous cycles across time
ContainsTools, rules, permissions, verification gatesScheduling, discovery, state, maker-checker agents
Without the other?Works fine — you just kick off runs manuallyInherits whatever quality (or lack thereof) the harness provides

You can have a great harness without a loop. You can't have a good loop without a great harness. The harness shapes one run. The loop orchestrates many.

The Anatomy of a Loop#

Every production loop shares six building blocks:

Anatomy of a Loop
Rendering diagram...

1. Trigger — What starts the loop? A cron schedule, a webhook from CI, a file change event, a PR being opened.

2. Discovery — What needs doing? The loop scans for work: failed CI runs, open issues, stale PRs, new commits to review.

3. State Check — What's already been handled? A state file tracks what's done, what's in progress, what failed. Without this, the loop rediscovers the same work every cycle.

4. Execution — The agent does the work. Critically, in an isolated environment — a git worktree or sandbox — so it can't corrupt your working tree or conflict with other agents running in parallel.

5. Verification — Did it work? Run tests, type-check, lint. Or better: a separate "checker" agent reviews the output independently. This is the gate that makes unattended loops viable.

6. Action — Deliver the result. Open a PR, update a ticket, post to Slack, deploy to staging.

Then: update the state file and wait for the next trigger.

The Maker-Checker Split#

This is the quality mechanism that separates loops that work from loops that produce garbage.

The problem: "The model that wrote the code is way too nice grading its own homework." A single agent asked to generate and verify will almost always approve its own work.

The solution: separate the agent that does the work (maker) from the agent that judges it (checker). The maker drafts. The checker reviews against tests, specs, or acceptance criteria. If the checker rejects, the loop retries or escalates.

Maker-Checker Pattern
Rendering diagram...

This is not optional for production loops. Without it, you're shipping unchecked work at machine speed.

State: The Durable Spine#

"The agent forgets, the repo doesn't."

Every loop needs persistent state that lives outside any single conversation. A state file (JSON, YAML, or just a markdown checklist in your repo) tracks:

  • What's been handled
  • What's currently in progress
  • What failed and how many times
  • What needs human review

Without persistent state, loops are amnesiac — they rediscover the same problems every cycle and retry failures indefinitely.

Trust Levels: The Phased Rollout Model#

You understand the anatomy. But not every loop should be autonomous from day one. The phased rollout model gives you a framework for progressively granting autonomy.

Trust Level Progression
Rendering diagram...
LevelWhat the Loop DoesWhat You DoExamplesWhen to Promote
L1: Report OnlyObserves and summarizesDecides and actsDaily standup summaries, PR status reports, issue triage suggestionsAlways start here
L2: Assisted FixesProposes changes (opens PRs, drafts patches)Reviews and approves before mergeAuto-fix CI failures as PR, dependency update PRs, lint correctionsAfter L1 runs reliably for days with consistent quality
L3: UnattendedActs autonomously within guardrailsAudits after the factAuto-merge passing dep updates, auto-fix known flaky tests, auto-deploy to stagingOnly after L2 is proven reliable AND you have verification gates + rollback

The golden rule: never skip levels. Every loop starts at L1. Trust is earned through demonstrated reliability, not assumed through clever design. A loop that skips straight to L3 is how you get surprise production deployments at 3 AM.

Seven Production Patterns#

Theory is useful, but what do loops look like in practice? These are concrete loops people are running in production today (patterns adapted from the loop-engineering reference repo by Cobus Greyling). Each starts at a safe trust level and can be promoted over time.

#PatternCadenceTrust LevelWhat It Does
1Daily Triage1d–2hL1Scans issues/PRs, summarizes status, flags blockers
2PR Babysitter5–15minL1Monitors PRs for CI failures, stale reviews, merge conflicts
3CI Sweeper5–15minL2Detects test/lint failures, attempts auto-fix, opens PR
4Dependency Updater6h–1dL2Checks outdated deps, creates update PRs with tests
5Changelog DrafterOn tag / 1dL1Generates release notes from merged PRs
6Post-Merge Cleanup1d–6hL1Finds dead code, unused imports, stale TODOs
7Issue Triage2h–1dL1Labels, categorizes, proposes assignees for new issues

Pattern Deep Dive: PR Babysitter (L1)#

A pure monitoring loop — it never modifies code. This makes it an ideal first loop: risk is zero, and the value is immediate visibility into stalled work.

StageWhat Happens
TriggerEvery 5 minutes (cron)
DiscoveryScan all open PRs for: CI failures older than 30 min, reviews requested > 24h ago, merge conflicts
State CheckSkip PRs already reported in the current cycle
ExecutionNone — this loop only observes
VerificationN/A (no changes to verify)
ActionPost summary to Slack or tracking issue; tag relevant people; optionally comment on the PR

Pattern Deep Dive: CI Sweeper (L2)#

The most popular first active loop — it has a clear verification signal (the test suite) and produces tangible value immediately.

StageWhat Happens
TriggerEvery 15 minutes, or on CI failure webhook
DiscoveryFind CI runs that failed in the last hour; filter out already-handled failures, known flaky tests, WIP branches
State CheckRead state file — skip runs already attempted or in progress
ExecutionSpawn agent in a git worktree; feed it the error output; ask it to fix the issue and run tests
VerificationRun full test suite in the worktree — no regressions? Checker agent reviews the diff for sanity
Action (pass)Open PR with explanation of what failed and how it was fixed
Action (fail)Log the attempt, update state file, notify human via Slack/issue
GuardrailsMax 3 retries per failure; 50K token budget per attempt; 10-minute time cap; kill switch file

Getting Started: Real Examples#

Let's see what all of this looks like when you actually run it. We'll walk through two loops — one L1 (report only) and one L2 (assisted fixes) — using Claude Code.

PR Babysitter in Practice (L1)#

The safest first loop: it observes and reports, never touches code. Zero risk, immediate visibility into stalled work.

/loop 5m For each open PR I care about: triage CI and reviews.
Summarize status, flag blockers, identify stale reviews.
Update pr-babysitter-state.md. Never modify code or push.

Here's how it maps to the six-stage anatomy:

Loop StageWhat This Example Does
Trigger/loop 5m — re-runs every 5 minutes
DiscoveryScans open PRs, triages CI status and review state for each
StateReads and updates pr-babysitter-state.md — tracks what's been reported
ExecutionNone — this loop only observes, never modifies code
VerificationN/A (no changes to verify)
ActionPosts summary to tracking issue; flags blockers and stale reviews
GuardrailsRead-only — no code changes, no pushes, no merges

Why start here? Because it teaches you the loop lifecycle with zero blast radius. You learn how state files work, how often the loop fires, what the agent discovers — all before giving it permission to change anything. Once you trust its observations, promote it: let it suggest fixes (L2), then auto-merge when CI is green and approvals are in (L3).

CI Sweeper in Practice (L2)#

Once you're comfortable with L1, step up to an active loop. Here's a CI Sweeper — taken from the loop-engineering examples:

/loop 15m $ci-triage — update ci-sweeper-state.md. Classify failures first.
Fix only clear regressions in a worktree with verifier. Max 3 attempts.
Escalate infra and security test failures.

That single line launches an autonomous loop. Here's how it maps to the six-stage anatomy:

Loop StageWhat This Example Does
Trigger/loop 15m — re-runs every 15 minutes automatically
Discovery$ci-triage skill scans CI for failures, classifies them (regression vs. infra vs. security)
StateReads and updates ci-sweeper-state.md — tracks what's handled, what's pending
ExecutionFixes clear regressions in an isolated worktree (won't touch your working tree)
VerificationVerifier sub-agent runs tests independently before approving the fix
ActionOpens PR for approved fixes; escalates infra/security failures to human
GuardrailsMax 3 attempts per failure, then stops retrying

What is $ci-triage? It's a Claude Code skill — a markdown file stored in .claude/skills/ that gives the agent persistent instructions for a specific job. When the loop invokes $ci-triage, Claude reads that skill file and knows how to find CI failures, classify them (regression vs. infra vs. flaky), and decide which ones are safe to auto-fix. You write the discovery logic once; the loop reuses it every cycle without re-explaining the context.

For an event-driven version using GitHub Actions (triggers on workflow failure instead of polling), see the ci-sweeper.yml example — it listens to workflow_run completion events on main, records failure context, updates state, and invokes the fix agent.

The loop-engineering repo has ready-to-use examples for all seven patterns across Claude Code, Codex, Grok, and GitHub Actions — clone and adapt to your project.

Tools and Platforms#

Once you understand the pattern, the question is where to run it. Here's what's available today.

Claude Code#

Claude Code has first-class loop engineering support:

  • /loop — Re-runs a prompt on a configurable cadence (the simplest possible loop)
  • /goal — Runs until a condition evaluates true (goal-oriented loop)
  • Hooks — Shell commands that fire before/after tool calls, acting as verification gates
  • Worktrees — Isolated parallel execution built into the agent framework
  • .claude/agents/ — Persistent agent definitions for reusable maker/checker agents
  • Skills — Project-specific knowledge that prevents re-explaining context each cycle
  • Memory — State that persists across sessions without cluttering the conversation

OpenAI Codex#

  • Automations tab — Scheduled agent runs with event-driven triggers
  • .codex/agents/ — TOML-based agent definitions for specialized loop roles
  • Built-in worktree support — Isolated execution per task

Build Your Own#

You don't need a platform. The primitives are simple:

cron + CLI agent + git worktrees + state file = loop from scratch

GitHub Actions works well as loop infrastructure: use a schedule trigger, install your agent CLI, read/write state to a file in the repo. You get logging, notifications, and cost visibility for free.

Use platform features when they save meaningful complexity; roll your own when you need full control over the execution model.

The Risks: What Loop Engineering Gets Wrong#

The patterns above make loops sound like pure upside. They're not. The failure modes are real and different from anything you've dealt with in manual prompting.

Comprehension Debt#

"The faster the loop ships code you didn't write, the bigger the gap between what exists and what you understand."

A smooth loop accelerates comprehension debt unless you actively read what it produces. The loop saves your time authoring. It doesn't save your time understanding.

Mitigation: Read the PRs. Review the diffs. Schedule weekly sessions to understand what the loop shipped. If you can't explain a change, you don't understand your codebase anymore.

Intent Debt#

Agents fill any gap in your instructions with confident guesses. In a loop, those guesses compound across cycles without you noticing until the accumulated drift is substantial.

Mitigation: Narrow scope per loop. Explicit acceptance criteria. Skills and rules files that encode your intent precisely. The tighter the spec, the less room for confident wrong guesses.

Cognitive Surrender#

"Designing the loop is the cure when done with judgment and the accelerant when done to avoid thinking — same action, opposite result."

The temptation to stop having opinions when loops run themselves is real. But you're still the engineer. The loop is your tool, not your replacement. The moment you stop caring what it produces is the moment you lose control of your codebase.

Runaway Costs#

A 15-minute loop with sub-agents can burn hundreds of dollars per day if uncapped. Token costs are invisible until the bill arrives.

Mitigation: Budget caps per cycle from day one. Start at L1 (cheap — just reporting). Monitor costs before promoting to L2/L3. Alert thresholds for daily spend.

Non-Determinism#

Same loop + same code does not equal same results. Two people can build identical loops and get opposite outcomes. The agent's generation is inherently stochastic.

Mitigation: Deterministic verification gates. The non-determinism in generation is acceptable if verification is reliable. This is why the verification signal is the most important design decision.

The "Done" Illusion#

"Done" is a claim, not a proof — even with verifier agents. Tests can pass while behavior is subtly wrong. A checker agent can miss edge cases. Responsibility doesn't transfer to the loop.

Mitigation: Treat loop output like junior dev output — trust but verify. L1 and L2 exist for this reason. Don't skip to L3 until you've built genuine confidence.

Best Practices#

With the risks in mind, here are distilled principles from practitioners running loops in production:

#PrincipleWhy It Matters
1Start at L1, earn L3Report before you fix. Fix before you deploy. Trust is earned.
2Verification is everythingNo deterministic check = no reliable loop. Non-negotiable.
3Separate maker from checkerNever let the agent grade its own work.
4State lives in the repoThe agent forgets. The state file remembers. Git tracks both.
5One loop, one jobNarrow scope = reliable outcomes. Add more loops, not more scope.
6Budget every cycleToken limits, iteration caps, time bounds. No exceptions.
7Fail loud, not longThree attempts, then escalate. Don't retry forever.
8Read what shipsThe loop saves authoring time, not review time.
9Design for interruptionEvery loop needs a kill switch and a human triage queue.
10Iterate the loop itselfYour loop design is a product. Monitor, measure, improve.

When NOT to Use a Loop#

Finally, a reality check. Not everything benefits from autonomous iteration. A loop is overkill — or actively harmful — when:

  • There's no verification signal. If you can't define pass/fail programmatically, you can't close the loop. Subjective tasks (design decisions, naming, architecture choices) need human judgment, not retry logic.
  • The task is one-off. Setting up a loop for a task you'll do once is over-engineering. Just prompt the agent directly.
  • You're exploring, not executing. Research, prototyping, and "what if" work are interactive by nature. A loop assumes you know what "done" looks like before you start.
  • The cost of being wrong is high and irreversible. Database migrations, production deployments to critical systems, anything where a wrong move has blast radius you can't undo with a revert.
  • You don't understand the domain well enough to write verification. If you can't specify what "correct" means, you can't build a reliable checker — and a loop without a reliable checker is a machine for shipping mistakes faster.

The right question isn't "can I build a loop for this?" — it's "do I have a verification signal that makes autonomy safe?"

Closing: The Architect's Mindset#

The paradigm shift is real — from "use the agent" to "design the system the agent runs in." But strip away the buzzword and loop engineering is just engineering: scheduling, state management, error handling, observability — applied to AI agents instead of microservices.

The biggest risk isn't the loop failing. It's the loop succeeding at things you don't understand. Comprehension debt is the silent killer of loop-driven teams.

Build the loop. Design it with judgment. Monitor what it ships. Stay curious about the code it writes.

"Build the loop. But build it like someone who intends to stay the engineer, not just the person who presses go." — Addy Osmani

References#