Loop Engineering: Designing Systems That Prompt Your Agents For You

June 19, 2026

"I don't prompt Claude anymore. I have loops running that prompt Claude and figuring out what to do. My job is to write loops." — Boris Cherny, Head of Claude Code at Anthropic

That quote launched a new term into the developer vocabulary: loop engineering. It captures a shift that's been building for two years — from crafting individual prompts to designing autonomous systems that run your AI agents while you sleep.

If you've been using AI coding agents, you've felt the ceiling: no matter how good your prompts are, you are the bottleneck. You type, you wait, you read, you type the next thing. The model is fast. You're not. The constraint has moved from "can the model do it?" to "can I keep up with feeding it work?"

Loop engineering removes you from that per-turn loop. You design autonomous systems that prompt your AI agents on a schedule or trigger — no human typing each instruction by hand. Your role shifts from operator to architect.

The term was coined in June 2026 by Addy Osmani (Google) and Boris Cherny (Anthropic). Peter Steinberger captured it well: "You shouldn't be prompting coding agents anymore. You should be designing loops that prompt your agents."

In this post, we'll trace the evolution from prompt engineering to loop engineering, break down what a loop actually is, walk through production patterns, and show a real example — then cover the risks that make this harder than it sounds.

The Four Paradigms: How We Got Here#

Working with AI has evolved through four distinct paradigms, each moving the developer further from the per-turn loop and closer to designing autonomous systems.

The Four Paradigms of Working with AI

Rendering diagram...

Prompt Engineering (2022–2024)#

The era of the perfect instruction. You craft few-shot examples, chain-of-thought triggers, careful wording. Every interaction requires your direct input. You optimize one call at a time — no persistence, no tools, no memory.

Your leverage point is the quality of what you say to the model.

The analogy: writing the perfect email to a smart but amnesiac assistant — every single time.

Context Engineering (2024–2025)#

The realization that what the model sees matters more than what you say. You build RAG pipelines, structure system prompts, manage retrieval strategies, design memory systems. A mediocre prompt with excellent context beats a perfect prompt with no context.

Your leverage point is the quality of information flow to the model — background knowledge, relevant files, conversation history.

The analogy: setting up a well-organized workspace for your assistant — filing cabinets, reference materials, tools within reach. But you still hand them each task personally.

Harness Engineering (2025–2026)#

The focus shifts to the environment the agent operates within. Tool definitions, permission boundaries, verification gates, CLAUDE.md files, agent personas, sandboxed execution. You shape how the agent behaves, what it can access, and how it validates its own work — all within a single run.

A well-harnessed agent with mediocre prompts outperforms a free-roaming agent with brilliant prompts. The harness IS the product.

Your leverage point is the structure and constraints around the agent — its operational boundaries.

The analogy: building a workbench with specialized jigs, guides, and safety stops. Your assistant does better work because the environment prevents mistakes. But you still bring them each job.

Loop Engineering (2026+)#

You design autonomous systems that find work, prompt agents, verify results, and persist state — without human input at each step. Scheduling, work discovery, state management, maker-checker separation, escalation paths, trust levels, cost control.

You design the loop once. Then the loop runs the agents. You review outcomes, not individual steps.

Your leverage point is the design of the autonomous cycle itself — how work is discovered, dispatched, verified, and completed without you in the per-turn loop.

The analogy: building a factory floor — conveyor belts, quality inspection stations, routing logic — where work flows through automatically. You designed it, you monitor it, but you're not on the line.

Comparison#

Dimension	Prompt Eng.	Context Eng.	Harness Eng.	Loop Eng.
Human role	Operator	Curator	Environment designer	Systems architect
Per-turn input?	Yes, every turn	Yes, session start	Yes, session start	No — system initiates
Scope	Single call	Session / conversation	Single agent run	Continuous autonomous cycles
What you optimize	The prompt text	The information flow	The agent's constraints	The autonomous system
Key artifact	Prompt template	RAG pipeline, memory	Rules files, tool config	Scheduled loops, state files
Failure mode	Bad output	Wrong context	Wrong behavior	Runaway automation

These Are Layers, Not Replacements#

Each paradigm builds on the previous ones. You still need good prompts inside good context inside a well-designed harness inside your loops. Loop engineering doesn't eliminate the others — it's the outermost layer. A loop with a bad harness is just fast bad output at scale.

What Is a Loop, Exactly?#

Now that we see where loop engineering sits in the evolution, let's define what a loop actually is.

A loop is a recursive goal where you define a purpose and the system iterates — using agents, verification, and external state — until the goal is complete or it hands off to a human. The key distinction: the loop prompts the agent, not you.

Loop vs. Agent Run#

An agent run is a single session: you give it a task, it works, it finishes. A loop is the system above: it decides when to start agent runs, what to feed them, how to verify their output, and what to do next.

The agent is the worker. The loop is the manager.

Harness vs. Loop#

These two get conflated, but they sit at different levels:

	Harness	Loop
What it is	The environment a single agent runs inside	The system one floor above that orchestrates many runs
Scope	One agent session	Continuous cycles across time
Contains	Tools, rules, permissions, verification gates	Scheduling, discovery, state, maker-checker agents
Without the other?	Works fine — you just kick off runs manually	Inherits whatever quality (or lack thereof) the harness provides

You can have a great harness without a loop. You can't have a good loop without a great harness. The harness shapes one run. The loop orchestrates many.

The Anatomy of a Loop#

Every production loop shares six building blocks:

Anatomy of a Loop

Rendering diagram...

1. Trigger — What starts the loop? A cron schedule, a webhook from CI, a file change event, a PR being opened.

2. Discovery — What needs doing? The loop scans for work: failed CI runs, open issues, stale PRs, new commits to review.

3. State Check — What's already been handled? A state file tracks what's done, what's in progress, what failed. Without this, the loop rediscovers the same work every cycle.

4. Execution — The agent does the work. Critically, in an isolated environment — a git worktree or sandbox — so it can't corrupt your working tree or conflict with other agents running in parallel.

5. Verification — Did it work? Run tests, type-check, lint. Or better: a separate "checker" agent reviews the output independently. This is the gate that makes unattended loops viable.

6. Action — Deliver the result. Open a PR, update a ticket, post to Slack, deploy to staging.

Then: update the state file and wait for the next trigger.

The Maker-Checker Split#

This is the quality mechanism that separates loops that work from loops that produce garbage.

The problem: "The model that wrote the code is way too nice grading its own homework." A single agent asked to generate and verify will almost always approve its own work.

The solution: separate the agent that does the work (maker) from the agent that judges it (checker). The maker drafts. The checker reviews against tests, specs, or acceptance criteria. If the checker rejects, the loop retries or escalates.

Maker-Checker Pattern

Rendering diagram...

This is not optional for production loops. Without it, you're shipping unchecked work at machine speed.

State: The Durable Spine#

"The agent forgets, the repo doesn't."

Every loop needs persistent state that lives outside any single conversation. A state file (JSON, YAML, or just a markdown checklist in your repo) tracks:

What's been handled
What's currently in progress
What failed and how many times
What needs human review

Without persistent state, loops are amnesiac — they rediscover the same problems every cycle and retry failures indefinitely.

Trust Levels: The Phased Rollout Model#

You understand the anatomy. But not every loop should be autonomous from day one. The phased rollout model gives you a framework for progressively granting autonomy.

Trust Level Progression

Rendering diagram...

Level	What the Loop Does	What You Do	Examples	When to Promote
L1: Report Only	Observes and summarizes	Decides and acts	Daily standup summaries, PR status reports, issue triage suggestions	Always start here
L2: Assisted Fixes	Proposes changes (opens PRs, drafts patches)	Reviews and approves before merge	Auto-fix CI failures as PR, dependency update PRs, lint corrections	After L1 runs reliably for days with consistent quality
L3: Unattended	Acts autonomously within guardrails	Audits after the fact	Auto-merge passing dep updates, auto-fix known flaky tests, auto-deploy to staging	Only after L2 is proven reliable AND you have verification gates + rollback

The golden rule: never skip levels. Every loop starts at L1. Trust is earned through demonstrated reliability, not assumed through clever design. A loop that skips straight to L3 is how you get surprise production deployments at 3 AM.

Seven Production Patterns#

Theory is useful, but what do loops look like in practice? These are concrete loops people are running in production today (patterns adapted from the loop-engineering reference repo by Cobus Greyling). Each starts at a safe trust level and can be promoted over time.

#	Pattern	Cadence	Trust Level	What It Does
1	Daily Triage	1d–2h	L1	Scans issues/PRs, summarizes status, flags blockers
2	PR Babysitter	5–15min	L1	Monitors PRs for CI failures, stale reviews, merge conflicts
3	CI Sweeper	5–15min	L2	Detects test/lint failures, attempts auto-fix, opens PR
4	Dependency Updater	6h–1d	L2	Checks outdated deps, creates update PRs with tests
5	Changelog Drafter	On tag / 1d	L1	Generates release notes from merged PRs
6	Post-Merge Cleanup	1d–6h	L1	Finds dead code, unused imports, stale TODOs
7	Issue Triage	2h–1d	L1	Labels, categorizes, proposes assignees for new issues

Pattern Deep Dive: PR Babysitter (L1)#

A pure monitoring loop — it never modifies code. This makes it an ideal first loop: risk is zero, and the value is immediate visibility into stalled work.

Stage	What Happens
Trigger	Every 5 minutes (cron)
Discovery	Scan all open PRs for: CI failures older than 30 min, reviews requested > 24h ago, merge conflicts
State Check	Skip PRs already reported in the current cycle
Execution	None — this loop only observes
Verification	N/A (no changes to verify)
Action	Post summary to Slack or tracking issue; tag relevant people; optionally comment on the PR

Pattern Deep Dive: CI Sweeper (L2)#

The most popular first active loop — it has a clear verification signal (the test suite) and produces tangible value immediately.

Stage	What Happens
Trigger	Every 15 minutes, or on CI failure webhook
Discovery	Find CI runs that failed in the last hour; filter out already-handled failures, known flaky tests, WIP branches
State Check	Read state file — skip runs already attempted or in progress
Execution	Spawn agent in a git worktree; feed it the error output; ask it to fix the issue and run tests
Verification	Run full test suite in the worktree — no regressions? Checker agent reviews the diff for sanity
Action (pass)	Open PR with explanation of what failed and how it was fixed
Action (fail)	Log the attempt, update state file, notify human via Slack/issue
Guardrails	Max 3 retries per failure; 50K token budget per attempt; 10-minute time cap; kill switch file

Getting Started: Real Examples#

Let's see what all of this looks like when you actually run it. We'll walk through two loops — one L1 (report only) and one L2 (assisted fixes) — using Claude Code.

PR Babysitter in Practice (L1)#

The safest first loop: it observes and reports, never touches code. Zero risk, immediate visibility into stalled work.

/loop 5m For each open PR I care about: triage CI and reviews.
Summarize status, flag blockers, identify stale reviews.
Update pr-babysitter-state.md. Never modify code or push.

Here's how it maps to the six-stage anatomy:

Loop Stage	What This Example Does
Trigger	/loop 5m — re-runs every 5 minutes
Discovery	Scans open PRs, triages CI status and review state for each
State	Reads and updates pr-babysitter-state.md — tracks what's been reported
Execution	None — this loop only observes, never modifies code
Verification	N/A (no changes to verify)
Action	Posts summary to tracking issue; flags blockers and stale reviews
Guardrails	Read-only — no code changes, no pushes, no merges

Why start here? Because it teaches you the loop lifecycle with zero blast radius. You learn how state files work, how often the loop fires, what the agent discovers — all before giving it permission to change anything. Once you trust its observations, promote it: let it suggest fixes (L2), then auto-merge when CI is green and approvals are in (L3).

CI Sweeper in Practice (L2)#

Once you're comfortable with L1, step up to an active loop. Here's a CI Sweeper — taken from the loop-engineering examples:

/loop 15m $ci-triage — update ci-sweeper-state.md. Classify failures first.
Fix only clear regressions in a worktree with verifier. Max 3 attempts.
Escalate infra and security test failures.

That single line launches an autonomous loop. Here's how it maps to the six-stage anatomy:

Loop Stage	What This Example Does
Trigger	/loop 15m — re-runs every 15 minutes automatically
Discovery	$ci-triage skill scans CI for failures, classifies them (regression vs. infra vs. security)
State	Reads and updates ci-sweeper-state.md — tracks what's handled, what's pending
Execution	Fixes clear regressions in an isolated worktree (won't touch your working tree)
Verification	Verifier sub-agent runs tests independently before approving the fix
Action	Opens PR for approved fixes; escalates infra/security failures to human
Guardrails	Max 3 attempts per failure, then stops retrying

What is $ci-triage? It's a Claude Code skill — a markdown file stored in .claude/skills/ that gives the agent persistent instructions for a specific job. When the loop invokes $ci-triage, Claude reads that skill file and knows how to find CI failures, classify them (regression vs. infra vs. flaky), and decide which ones are safe to auto-fix. You write the discovery logic once; the loop reuses it every cycle without re-explaining the context.

For an event-driven version using GitHub Actions (triggers on workflow failure instead of polling), see the ci-sweeper.yml example — it listens to workflow_run completion events on main, records failure context, updates state, and invokes the fix agent.

The loop-engineering repo has ready-to-use examples for all seven patterns across Claude Code, Codex, Grok, and GitHub Actions — clone and adapt to your project.

Tools and Platforms#

Once you understand the pattern, the question is where to run it. Here's what's available today.

Claude Code#

Claude Code has first-class loop engineering support:

/loop — Re-runs a prompt on a configurable cadence (the simplest possible loop)
/goal — Runs until a condition evaluates true (goal-oriented loop)
Hooks — Shell commands that fire before/after tool calls, acting as verification gates
Worktrees — Isolated parallel execution built into the agent framework
.claude/agents/ — Persistent agent definitions for reusable maker/checker agents
Skills — Project-specific knowledge that prevents re-explaining context each cycle
Memory — State that persists across sessions without cluttering the conversation

OpenAI Codex#

Automations tab — Scheduled agent runs with event-driven triggers
.codex/agents/ — TOML-based agent definitions for specialized loop roles
Built-in worktree support — Isolated execution per task

Build Your Own#

You don't need a platform. The primitives are simple:

cron + CLI agent + git worktrees + state file = loop from scratch

GitHub Actions works well as loop infrastructure: use a schedule trigger, install your agent CLI, read/write state to a file in the repo. You get logging, notifications, and cost visibility for free.

Use platform features when they save meaningful complexity; roll your own when you need full control over the execution model.

The Risks: What Loop Engineering Gets Wrong#

The patterns above make loops sound like pure upside. They're not. The failure modes are real and different from anything you've dealt with in manual prompting.

Comprehension Debt#

"The faster the loop ships code you didn't write, the bigger the gap between what exists and what you understand."

A smooth loop accelerates comprehension debt unless you actively read what it produces. The loop saves your time authoring. It doesn't save your time understanding.

Mitigation: Read the PRs. Review the diffs. Schedule weekly sessions to understand what the loop shipped. If you can't explain a change, you don't understand your codebase anymore.

Intent Debt#

Agents fill any gap in your instructions with confident guesses. In a loop, those guesses compound across cycles without you noticing until the accumulated drift is substantial.

Mitigation: Narrow scope per loop. Explicit acceptance criteria. Skills and rules files that encode your intent precisely. The tighter the spec, the less room for confident wrong guesses.

Cognitive Surrender#

"Designing the loop is the cure when done with judgment and the accelerant when done to avoid thinking — same action, opposite result."

The temptation to stop having opinions when loops run themselves is real. But you're still the engineer. The loop is your tool, not your replacement. The moment you stop caring what it produces is the moment you lose control of your codebase.

Runaway Costs#

A 15-minute loop with sub-agents can burn hundreds of dollars per day if uncapped. Token costs are invisible until the bill arrives.

Mitigation: Budget caps per cycle from day one. Start at L1 (cheap — just reporting). Monitor costs before promoting to L2/L3. Alert thresholds for daily spend.

Non-Determinism#

Same loop + same code does not equal same results. Two people can build identical loops and get opposite outcomes. The agent's generation is inherently stochastic.

Mitigation: Deterministic verification gates. The non-determinism in generation is acceptable if verification is reliable. This is why the verification signal is the most important design decision.

The "Done" Illusion#

"Done" is a claim, not a proof — even with verifier agents. Tests can pass while behavior is subtly wrong. A checker agent can miss edge cases. Responsibility doesn't transfer to the loop.

Mitigation: Treat loop output like junior dev output — trust but verify. L1 and L2 exist for this reason. Don't skip to L3 until you've built genuine confidence.

Best Practices#

With the risks in mind, here are distilled principles from practitioners running loops in production:

#	Principle	Why It Matters
1	Start at L1, earn L3	Report before you fix. Fix before you deploy. Trust is earned.
2	Verification is everything	No deterministic check = no reliable loop. Non-negotiable.
3	Separate maker from checker	Never let the agent grade its own work.
4	State lives in the repo	The agent forgets. The state file remembers. Git tracks both.
5	One loop, one job	Narrow scope = reliable outcomes. Add more loops, not more scope.
6	Budget every cycle	Token limits, iteration caps, time bounds. No exceptions.
7	Fail loud, not long	Three attempts, then escalate. Don't retry forever.
8	Read what ships	The loop saves authoring time, not review time.
9	Design for interruption	Every loop needs a kill switch and a human triage queue.
10	Iterate the loop itself	Your loop design is a product. Monitor, measure, improve.

When NOT to Use a Loop#

Finally, a reality check. Not everything benefits from autonomous iteration. A loop is overkill — or actively harmful — when:

There's no verification signal. If you can't define pass/fail programmatically, you can't close the loop. Subjective tasks (design decisions, naming, architecture choices) need human judgment, not retry logic.
The task is one-off. Setting up a loop for a task you'll do once is over-engineering. Just prompt the agent directly.
You're exploring, not executing. Research, prototyping, and "what if" work are interactive by nature. A loop assumes you know what "done" looks like before you start.
The cost of being wrong is high and irreversible. Database migrations, production deployments to critical systems, anything where a wrong move has blast radius you can't undo with a revert.
You don't understand the domain well enough to write verification. If you can't specify what "correct" means, you can't build a reliable checker — and a loop without a reliable checker is a machine for shipping mistakes faster.

The right question isn't "can I build a loop for this?" — it's "do I have a verification signal that makes autonomy safe?"

Closing: The Architect's Mindset#

The paradigm shift is real — from "use the agent" to "design the system the agent runs in." But strip away the buzzword and loop engineering is just engineering: scheduling, state management, error handling, observability — applied to AI agents instead of microservices.

The biggest risk isn't the loop failing. It's the loop succeeding at things you don't understand. Comprehension debt is the silent killer of loop-driven teams.

Build the loop. Design it with judgment. Monitor what it ships. Stay curious about the code it writes.

"Build the loop. But build it like someone who intends to stay the engineer, not just the person who presses go." — Addy Osmani

References#

Addy Osmani, "Loop Engineering" (June 8, 2026)
Cobus Greyling, loop-engineering — CLI tools and patterns reference (MIT)
Anthropic, "Building Effective Agents" (Dec 2024)
Phil Schmid, "Agents: Inner Loop vs Outer Loop"
Martin Fowler, "Humans and Agents in Software Engineering Loops"
Claude Code Documentation — Best practices for agentic loops