Loop Engineering: Designing Systems That Prompt Your Agents For You
"I don't prompt Claude anymore. I have loops running that prompt Claude and figuring out what to do. My job is to write loops." — Boris Cherny, Head of Claude Code at Anthropic
That quote launched a new term into the developer vocabulary: loop engineering. It captures a shift that's been building for two years — from crafting individual prompts to designing autonomous systems that run your AI agents while you sleep.
If you've been using AI coding agents, you've felt the ceiling: no matter how good your prompts are, you are the bottleneck. You type, you wait, you read, you type the next thing. The model is fast. You're not. The constraint has moved from "can the model do it?" to "can I keep up with feeding it work?"
Loop engineering removes you from that per-turn loop. You design autonomous systems that prompt your AI agents on a schedule or trigger — no human typing each instruction by hand. Your role shifts from operator to architect.
The term was coined in June 2026 by Addy Osmani (Google) and Boris Cherny (Anthropic). Peter Steinberger captured it well: "You shouldn't be prompting coding agents anymore. You should be designing loops that prompt your agents."
In this post, we'll trace the evolution from prompt engineering to loop engineering, break down what a loop actually is, walk through production patterns, and show a real example — then cover the risks that make this harder than it sounds.
The Four Paradigms: How We Got Here#
Working with AI has evolved through four distinct paradigms, each moving the developer further from the per-turn loop and closer to designing autonomous systems.
Prompt Engineering (2022–2024)#
The era of the perfect instruction. You craft few-shot examples, chain-of-thought triggers, careful wording. Every interaction requires your direct input. You optimize one call at a time — no persistence, no tools, no memory.
Your leverage point is the quality of what you say to the model.
The analogy: writing the perfect email to a smart but amnesiac assistant — every single time.
Context Engineering (2024–2025)#
The realization that what the model sees matters more than what you say. You build RAG pipelines, structure system prompts, manage retrieval strategies, design memory systems. A mediocre prompt with excellent context beats a perfect prompt with no context.
Your leverage point is the quality of information flow to the model — background knowledge, relevant files, conversation history.
The analogy: setting up a well-organized workspace for your assistant — filing cabinets, reference materials, tools within reach. But you still hand them each task personally.
Harness Engineering (2025–2026)#
The focus shifts to the environment the agent operates within. Tool definitions, permission boundaries, verification gates, CLAUDE.md files, agent personas, sandboxed execution. You shape how the agent behaves, what it can access, and how it validates its own work — all within a single run.
A well-harnessed agent with mediocre prompts outperforms a free-roaming agent with brilliant prompts. The harness IS the product.
Your leverage point is the structure and constraints around the agent — its operational boundaries.
The analogy: building a workbench with specialized jigs, guides, and safety stops. Your assistant does better work because the environment prevents mistakes. But you still bring them each job.
Loop Engineering (2026+)#
You design autonomous systems that find work, prompt agents, verify results, and persist state — without human input at each step. Scheduling, work discovery, state management, maker-checker separation, escalation paths, trust levels, cost control.
You design the loop once. Then the loop runs the agents. You review outcomes, not individual steps.
Your leverage point is the design of the autonomous cycle itself — how work is discovered, dispatched, verified, and completed without you in the per-turn loop.
The analogy: building a factory floor — conveyor belts, quality inspection stations, routing logic — where work flows through automatically. You designed it, you monitor it, but you're not on the line.
Comparison#
| Dimension | Prompt Eng. | Context Eng. | Harness Eng. | Loop Eng. |
|---|---|---|---|---|
| Human role | Operator | Curator | Environment designer | Systems architect |
| Per-turn input? | Yes, every turn | Yes, session start | Yes, session start | No — system initiates |
| Scope | Single call | Session / conversation | Single agent run | Continuous autonomous cycles |
| What you optimize | The prompt text | The information flow | The agent's constraints | The autonomous system |
| Key artifact | Prompt template | RAG pipeline, memory | Rules files, tool config | Scheduled loops, state files |
| Failure mode | Bad output | Wrong context | Wrong behavior | Runaway automation |
These Are Layers, Not Replacements#
Each paradigm builds on the previous ones. You still need good prompts inside good context inside a well-designed harness inside your loops. Loop engineering doesn't eliminate the others — it's the outermost layer. A loop with a bad harness is just fast bad output at scale.
What Is a Loop, Exactly?#
Now that we see where loop engineering sits in the evolution, let's define what a loop actually is.
A loop is a recursive goal where you define a purpose and the system iterates — using agents, verification, and external state — until the goal is complete or it hands off to a human. The key distinction: the loop prompts the agent, not you.
Loop vs. Agent Run#
An agent run is a single session: you give it a task, it works, it finishes. A loop is the system above: it decides when to start agent runs, what to feed them, how to verify their output, and what to do next.
The agent is the worker. The loop is the manager.
Harness vs. Loop#
These two get conflated, but they sit at different levels:
| Harness | Loop | |
|---|---|---|
| What it is | The environment a single agent runs inside | The system one floor above that orchestrates many runs |
| Scope | One agent session | Continuous cycles across time |
| Contains | Tools, rules, permissions, verification gates | Scheduling, discovery, state, maker-checker agents |
| Without the other? | Works fine — you just kick off runs manually | Inherits whatever quality (or lack thereof) the harness provides |
You can have a great harness without a loop. You can't have a good loop without a great harness. The harness shapes one run. The loop orchestrates many.
The Anatomy of a Loop#
Every production loop shares six building blocks:
1. Trigger — What starts the loop? A cron schedule, a webhook from CI, a file change event, a PR being opened.
2. Discovery — What needs doing? The loop scans for work: failed CI runs, open issues, stale PRs, new commits to review.
3. State Check — What's already been handled? A state file tracks what's done, what's in progress, what failed. Without this, the loop rediscovers the same work every cycle.
4. Execution — The agent does the work. Critically, in an isolated environment — a git worktree or sandbox — so it can't corrupt your working tree or conflict with other agents running in parallel.
5. Verification — Did it work? Run tests, type-check, lint. Or better: a separate "checker" agent reviews the output independently. This is the gate that makes unattended loops viable.
6. Action — Deliver the result. Open a PR, update a ticket, post to Slack, deploy to staging.
Then: update the state file and wait for the next trigger.
The Maker-Checker Split#
This is the quality mechanism that separates loops that work from loops that produce garbage.
The problem: "The model that wrote the code is way too nice grading its own homework." A single agent asked to generate and verify will almost always approve its own work.
The solution: separate the agent that does the work (maker) from the agent that judges it (checker). The maker drafts. The checker reviews against tests, specs, or acceptance criteria. If the checker rejects, the loop retries or escalates.
This is not optional for production loops. Without it, you're shipping unchecked work at machine speed.
State: The Durable Spine#
"The agent forgets, the repo doesn't."
Every loop needs persistent state that lives outside any single conversation. A state file (JSON, YAML, or just a markdown checklist in your repo) tracks:
- What's been handled
- What's currently in progress
- What failed and how many times
- What needs human review
Without persistent state, loops are amnesiac — they rediscover the same problems every cycle and retry failures indefinitely.
Trust Levels: The Phased Rollout Model#
You understand the anatomy. But not every loop should be autonomous from day one. The phased rollout model gives you a framework for progressively granting autonomy.
| Level | What the Loop Does | What You Do | Examples | When to Promote |
|---|---|---|---|---|
| L1: Report Only | Observes and summarizes | Decides and acts | Daily standup summaries, PR status reports, issue triage suggestions | Always start here |
| L2: Assisted Fixes | Proposes changes (opens PRs, drafts patches) | Reviews and approves before merge | Auto-fix CI failures as PR, dependency update PRs, lint corrections | After L1 runs reliably for days with consistent quality |
| L3: Unattended | Acts autonomously within guardrails | Audits after the fact | Auto-merge passing dep updates, auto-fix known flaky tests, auto-deploy to staging | Only after L2 is proven reliable AND you have verification gates + rollback |
The golden rule: never skip levels. Every loop starts at L1. Trust is earned through demonstrated reliability, not assumed through clever design. A loop that skips straight to L3 is how you get surprise production deployments at 3 AM.
Seven Production Patterns#
Theory is useful, but what do loops look like in practice? These are concrete loops people are running in production today (patterns adapted from the loop-engineering reference repo by Cobus Greyling). Each starts at a safe trust level and can be promoted over time.
| # | Pattern | Cadence | Trust Level | What It Does |
|---|---|---|---|---|
| 1 | Daily Triage | 1d–2h | L1 | Scans issues/PRs, summarizes status, flags blockers |
| 2 | PR Babysitter | 5–15min | L1 | Monitors PRs for CI failures, stale reviews, merge conflicts |
| 3 | CI Sweeper | 5–15min | L2 | Detects test/lint failures, attempts auto-fix, opens PR |
| 4 | Dependency Updater | 6h–1d | L2 | Checks outdated deps, creates update PRs with tests |
| 5 | Changelog Drafter | On tag / 1d | L1 | Generates release notes from merged PRs |
| 6 | Post-Merge Cleanup | 1d–6h | L1 | Finds dead code, unused imports, stale TODOs |
| 7 | Issue Triage | 2h–1d | L1 | Labels, categorizes, proposes assignees for new issues |
Pattern Deep Dive: PR Babysitter (L1)#
A pure monitoring loop — it never modifies code. This makes it an ideal first loop: risk is zero, and the value is immediate visibility into stalled work.
| Stage | What Happens |
|---|---|
| Trigger | Every 5 minutes (cron) |
| Discovery | Scan all open PRs for: CI failures older than 30 min, reviews requested > 24h ago, merge conflicts |
| State Check | Skip PRs already reported in the current cycle |
| Execution | None — this loop only observes |
| Verification | N/A (no changes to verify) |
| Action | Post summary to Slack or tracking issue; tag relevant people; optionally comment on the PR |
Pattern Deep Dive: CI Sweeper (L2)#
The most popular first active loop — it has a clear verification signal (the test suite) and produces tangible value immediately.
| Stage | What Happens |
|---|---|
| Trigger | Every 15 minutes, or on CI failure webhook |
| Discovery | Find CI runs that failed in the last hour; filter out already-handled failures, known flaky tests, WIP branches |
| State Check | Read state file — skip runs already attempted or in progress |
| Execution | Spawn agent in a git worktree; feed it the error output; ask it to fix the issue and run tests |
| Verification | Run full test suite in the worktree — no regressions? Checker agent reviews the diff for sanity |
| Action (pass) | Open PR with explanation of what failed and how it was fixed |
| Action (fail) | Log the attempt, update state file, notify human via Slack/issue |
| Guardrails | Max 3 retries per failure; 50K token budget per attempt; 10-minute time cap; kill switch file |
Getting Started: Real Examples#
Let's see what all of this looks like when you actually run it. We'll walk through two loops — one L1 (report only) and one L2 (assisted fixes) — using Claude Code.
PR Babysitter in Practice (L1)#
The safest first loop: it observes and reports, never touches code. Zero risk, immediate visibility into stalled work.
/loop 5m For each open PR I care about: triage CI and reviews.
Summarize status, flag blockers, identify stale reviews.
Update pr-babysitter-state.md. Never modify code or push.
Here's how it maps to the six-stage anatomy:
| Loop Stage | What This Example Does |
|---|---|
| Trigger | /loop 5m — re-runs every 5 minutes |
| Discovery | Scans open PRs, triages CI status and review state for each |
| State | Reads and updates pr-babysitter-state.md — tracks what's been reported |
| Execution | None — this loop only observes, never modifies code |
| Verification | N/A (no changes to verify) |
| Action | Posts summary to tracking issue; flags blockers and stale reviews |
| Guardrails | Read-only — no code changes, no pushes, no merges |
Why start here? Because it teaches you the loop lifecycle with zero blast radius. You learn how state files work, how often the loop fires, what the agent discovers — all before giving it permission to change anything. Once you trust its observations, promote it: let it suggest fixes (L2), then auto-merge when CI is green and approvals are in (L3).
CI Sweeper in Practice (L2)#
Once you're comfortable with L1, step up to an active loop. Here's a CI Sweeper — taken from the loop-engineering examples:
/loop 15m $ci-triage — update ci-sweeper-state.md. Classify failures first.
Fix only clear regressions in a worktree with verifier. Max 3 attempts.
Escalate infra and security test failures.
That single line launches an autonomous loop. Here's how it maps to the six-stage anatomy:
| Loop Stage | What This Example Does |
|---|---|
| Trigger | /loop 15m — re-runs every 15 minutes automatically |
| Discovery | $ci-triage skill scans CI for failures, classifies them (regression vs. infra vs. security) |
| State | Reads and updates ci-sweeper-state.md — tracks what's handled, what's pending |
| Execution | Fixes clear regressions in an isolated worktree (won't touch your working tree) |
| Verification | Verifier sub-agent runs tests independently before approving the fix |
| Action | Opens PR for approved fixes; escalates infra/security failures to human |
| Guardrails | Max 3 attempts per failure, then stops retrying |
What is $ci-triage? It's a Claude Code skill — a markdown file stored in .claude/skills/ that gives the agent persistent instructions for a specific job. When the loop invokes $ci-triage, Claude reads that skill file and knows how to find CI failures, classify them (regression vs. infra vs. flaky), and decide which ones are safe to auto-fix. You write the discovery logic once; the loop reuses it every cycle without re-explaining the context.
For an event-driven version using GitHub Actions (triggers on workflow failure instead of polling), see the ci-sweeper.yml example — it listens to workflow_run completion events on main, records failure context, updates state, and invokes the fix agent.
The loop-engineering repo has ready-to-use examples for all seven patterns across Claude Code, Codex, Grok, and GitHub Actions — clone and adapt to your project.
Tools and Platforms#
Once you understand the pattern, the question is where to run it. Here's what's available today.
Claude Code#
Claude Code has first-class loop engineering support:
/loop— Re-runs a prompt on a configurable cadence (the simplest possible loop)/goal— Runs until a condition evaluates true (goal-oriented loop)- Hooks — Shell commands that fire before/after tool calls, acting as verification gates
- Worktrees — Isolated parallel execution built into the agent framework
.claude/agents/— Persistent agent definitions for reusable maker/checker agents- Skills — Project-specific knowledge that prevents re-explaining context each cycle
- Memory — State that persists across sessions without cluttering the conversation
OpenAI Codex#
- Automations tab — Scheduled agent runs with event-driven triggers
.codex/agents/— TOML-based agent definitions for specialized loop roles- Built-in worktree support — Isolated execution per task
Build Your Own#
You don't need a platform. The primitives are simple:
cron + CLI agent + git worktrees + state file = loop from scratch
GitHub Actions works well as loop infrastructure: use a schedule trigger, install your agent CLI, read/write state to a file in the repo. You get logging, notifications, and cost visibility for free.
Use platform features when they save meaningful complexity; roll your own when you need full control over the execution model.
The Risks: What Loop Engineering Gets Wrong#
The patterns above make loops sound like pure upside. They're not. The failure modes are real and different from anything you've dealt with in manual prompting.
Comprehension Debt#
"The faster the loop ships code you didn't write, the bigger the gap between what exists and what you understand."
A smooth loop accelerates comprehension debt unless you actively read what it produces. The loop saves your time authoring. It doesn't save your time understanding.
Mitigation: Read the PRs. Review the diffs. Schedule weekly sessions to understand what the loop shipped. If you can't explain a change, you don't understand your codebase anymore.
Intent Debt#
Agents fill any gap in your instructions with confident guesses. In a loop, those guesses compound across cycles without you noticing until the accumulated drift is substantial.
Mitigation: Narrow scope per loop. Explicit acceptance criteria. Skills and rules files that encode your intent precisely. The tighter the spec, the less room for confident wrong guesses.
Cognitive Surrender#
"Designing the loop is the cure when done with judgment and the accelerant when done to avoid thinking — same action, opposite result."
The temptation to stop having opinions when loops run themselves is real. But you're still the engineer. The loop is your tool, not your replacement. The moment you stop caring what it produces is the moment you lose control of your codebase.
Runaway Costs#
A 15-minute loop with sub-agents can burn hundreds of dollars per day if uncapped. Token costs are invisible until the bill arrives.
Mitigation: Budget caps per cycle from day one. Start at L1 (cheap — just reporting). Monitor costs before promoting to L2/L3. Alert thresholds for daily spend.
Non-Determinism#
Same loop + same code does not equal same results. Two people can build identical loops and get opposite outcomes. The agent's generation is inherently stochastic.
Mitigation: Deterministic verification gates. The non-determinism in generation is acceptable if verification is reliable. This is why the verification signal is the most important design decision.
The "Done" Illusion#
"Done" is a claim, not a proof — even with verifier agents. Tests can pass while behavior is subtly wrong. A checker agent can miss edge cases. Responsibility doesn't transfer to the loop.
Mitigation: Treat loop output like junior dev output — trust but verify. L1 and L2 exist for this reason. Don't skip to L3 until you've built genuine confidence.
Best Practices#
With the risks in mind, here are distilled principles from practitioners running loops in production:
| # | Principle | Why It Matters |
|---|---|---|
| 1 | Start at L1, earn L3 | Report before you fix. Fix before you deploy. Trust is earned. |
| 2 | Verification is everything | No deterministic check = no reliable loop. Non-negotiable. |
| 3 | Separate maker from checker | Never let the agent grade its own work. |
| 4 | State lives in the repo | The agent forgets. The state file remembers. Git tracks both. |
| 5 | One loop, one job | Narrow scope = reliable outcomes. Add more loops, not more scope. |
| 6 | Budget every cycle | Token limits, iteration caps, time bounds. No exceptions. |
| 7 | Fail loud, not long | Three attempts, then escalate. Don't retry forever. |
| 8 | Read what ships | The loop saves authoring time, not review time. |
| 9 | Design for interruption | Every loop needs a kill switch and a human triage queue. |
| 10 | Iterate the loop itself | Your loop design is a product. Monitor, measure, improve. |
When NOT to Use a Loop#
Finally, a reality check. Not everything benefits from autonomous iteration. A loop is overkill — or actively harmful — when:
- There's no verification signal. If you can't define pass/fail programmatically, you can't close the loop. Subjective tasks (design decisions, naming, architecture choices) need human judgment, not retry logic.
- The task is one-off. Setting up a loop for a task you'll do once is over-engineering. Just prompt the agent directly.
- You're exploring, not executing. Research, prototyping, and "what if" work are interactive by nature. A loop assumes you know what "done" looks like before you start.
- The cost of being wrong is high and irreversible. Database migrations, production deployments to critical systems, anything where a wrong move has blast radius you can't undo with a revert.
- You don't understand the domain well enough to write verification. If you can't specify what "correct" means, you can't build a reliable checker — and a loop without a reliable checker is a machine for shipping mistakes faster.
The right question isn't "can I build a loop for this?" — it's "do I have a verification signal that makes autonomy safe?"
Closing: The Architect's Mindset#
The paradigm shift is real — from "use the agent" to "design the system the agent runs in." But strip away the buzzword and loop engineering is just engineering: scheduling, state management, error handling, observability — applied to AI agents instead of microservices.
The biggest risk isn't the loop failing. It's the loop succeeding at things you don't understand. Comprehension debt is the silent killer of loop-driven teams.
Build the loop. Design it with judgment. Monitor what it ships. Stay curious about the code it writes.
"Build the loop. But build it like someone who intends to stay the engineer, not just the person who presses go." — Addy Osmani
References#
- Addy Osmani, "Loop Engineering" (June 8, 2026)
- Cobus Greyling, loop-engineering — CLI tools and patterns reference (MIT)
- Anthropic, "Building Effective Agents" (Dec 2024)
- Phil Schmid, "Agents: Inner Loop vs Outer Loop"
- Martin Fowler, "Humans and Agents in Software Engineering Loops"
- Claude Code Documentation — Best practices for agentic loops