Agent Harness Engineering
The previous section explained what AI agents are: systems where a language model runs in a loop, calling tools and reasoning over results until a goal is achieved. But that description glosses over a critical question — who runs the loop?
The language model does not run itself. It cannot call its own tools, manage its own memory, enforce its own safety rules, or decide when to stop. Something else must do all of that. That something is the agent harness: the infrastructure code that wraps the language model and turns it from a text-completion engine into an autonomous agent.
If the agent is the pilot, the harness is the cockpit — the instruments, the controls, the autopilot logic, and the emergency overrides. A skilled pilot in a bad cockpit crashes. The same model behind a well-engineered harness versus a poorly-engineered one produces dramatically different results.
This chapter covers how to design and build that harness: the execution loop, tool dispatch, context management, guardrails, observability, and the production concerns that separate a working demo from a reliable system.
Why This Chapter Matters: When the Harness Breaks, the Model Looks Broken#
In April 2026, users across the internet reported that Claude Code had become noticeably "dumber." Responses were shallower, the agent forgot context mid-task, and tool choices felt random. Many assumed Anthropic had degraded the model. They hadn't. The postmortem revealed three harness-level bugs — none involving the model itself:
-
A thinking cache bug (deployed March 26, fixed April 10): A caching optimization malfunctioned, causing the system to clear the model's chain-of-thought reasoning on every turn instead of only once on session resumption. The agent was effectively lobotomized — forced to reason from scratch each iteration while appearing to have full context. Users experienced forgetfulness, repetition, and bizarre tool choices.
-
A reasoning effort downgrade (March 4 – April 7): The default reasoning effort parameter was silently shifted from
hightomediumas a latency optimization. Internal benchmarks showed marginal quality difference; real users noticed immediately. The lesson: latency gains that trade away intelligence are invisible to automated evals but obvious to humans doing real work. -
A system prompt constraint (April 16 – April 20): A single instruction — "keep text between tool calls to ≤25 words" — caused a measurable 3% drop in coding evaluations. A few words in the prompt, added to reduce verbosity, inadvertently constrained the model's ability to reason through complex code changes.
These incidents illustrate the central thesis of this chapter: the harness matters as much as the model. A frontier model behind a buggy harness — one that silently drops reasoning, misconfigures parameters, or routes requests to wrong infrastructure — will underperform a smaller model with a well-engineered harness. The model provides intelligence; the harness provides the conditions under which that intelligence can actually express itself. When the harness breaks, it does not throw an error. It makes the model look broken. And that is far harder to debug.
The Execution Loop#
Every agent harness has a central loop at its core. The model generates output; the harness inspects it; if the output contains tool calls, the harness executes them and feeds the results back; if the output is a final answer, the loop terminates. This is the heartbeat of every agent system, regardless of framework.
The Core Execution Loop
The harness runs a turn-based loop. On each iteration, it sends the full conversation (system prompt + history + tool results) to the LLM, then inspects the response. Three things can happen: the model returns a final answer (loop ends), the model requests tool calls (harness executes them and loops), or the model requests a handoff to another agent (harness switches and loops). A step counter enforces a hard ceiling on iterations.
Here is the loop in pseudocode — the pattern is identical across OpenAI Agents SDK, Vercel AI SDK, CrewAI, and Claude Code:
messages = [system_prompt, user_goal]
step = 0
while step < max_steps:
response = call_llm(messages, tools)
if response.has_final_text():
return response.text # Done
if response.has_handoff():
active_agent = response.target # Switch agent
messages.append(handoff_context)
step += 1
continue
for tool_call in response.tool_calls:
result = execute(tool_call.name, tool_call.args)
messages.append(tool_result(tool_call.id, result))
step += 1
raise MaxStepsExceeded()
The step limit is the most important safety mechanism in the entire harness. Without it, a confused or hallucinating model will loop until you hit an API rate limit or run out of budget. Every production harness must have one.
Tool Dispatch#
The execution loop calls tools, but how tools are dispatched, validated, and sandboxed is a design problem in its own right. The harness is responsible for three things: translating the model's tool call into an actual function execution, validating the inputs before running anything, and returning the result in a format the model can understand.
Tool Dispatch Architecture
When the LLM emits a tool call, the harness looks up the tool by name in a registry, validates the arguments against a schema (typically Zod or JSON Schema), executes the function in a controlled environment, and appends the result to the conversation. The registry pattern decouples tool definitions from the execution loop — tools can be added, removed, or modified without changing the loop.
Tool Design Principles for Harness Engineers#
The AI Agents chapter covered tool design from the model's perspective — how to name and scope tools so the model uses them correctly. From the harness engineer's perspective, there are additional concerns:
| Concern | What It Means | How to Handle It |
|---|---|---|
| Idempotency | Read tools (file_view, grep, ls) can be called repeatedly without side effects. Write tools (file_edit, shell_exec) may not be safe to retry | Mark tools as read-only or mutating. Implement confirmation steps or dry-run modes for mutating tools. Log every mutation for auditability |
| Result sizing | A tool that returns a 200KB file or a 10,000-row query result can consume the entire context window in one call | Set maximum result sizes per tool. Truncate with a clear marker ('... truncated, showing first 200 lines of 5,000'). Let the model request more if needed |
| Timeout handling | A web request or code execution that hangs will block the entire agent loop indefinitely | Set per-tool timeouts. Return timeout errors as tool results so the model can decide how to proceed (retry, skip, or use an alternative approach) |
| Error as information | When a tool fails, the error message is valuable — it tells the model what went wrong so it can self-correct | Never swallow errors silently. Return the error message as the tool result. The model can often fix its approach based on the error (e.g., fixing a typo in a file path) |
| Sandboxing | Shell execution tools can run arbitrary commands. Without sandboxing, a confused model could delete files or exfiltrate data | Run mutating tools in sandboxed environments. Implement permission systems (allow-lists, confirmation prompts). Claude Code uses a permission mode system with escalating trust levels |
CLI Is All You Need#
Before diving into the remaining harness components — context management, guardrails, observability, error handling — it is worth stepping back to ask a more fundamental question: what interface should developers use to interact with an agent? The answer, for coding agents specifically, is overwhelmingly the command line.
LLMs are text-native. They consume text and produce text. The terminal is the purest text interface available — there is no impedance mismatch between what the model generates and what the developer sees. A GUI adds a translation layer that must render, re-render, and interpret model output; a CLI simply prints it. This directness is not a limitation — it is a feature.
Why CLI Wins for Agent Interfaces
The terminal provides composability, scriptability, sandboxing, and editor agnosticism for free — capabilities that GUI-based agents must build from scratch. The most successful coding agents (Claude Code, Codex CLI, Gemini CLI) all converged on CLI-first design independently, suggesting this is a natural fit rather than a stylistic preference.
The convergence is striking: Claude Code, Codex CLI, and Gemini CLI were all built by different organizations with different models, yet they all arrived at the same CLI-first architecture. This is not coincidence — it is the natural consequence of building agents on top of text-native models for a developer audience that already lives in the terminal.
| CLI Agent | Key Design Choice |
|---|---|
| Claude Code | Explicitly designed as a Unix utility. Supports piping (cat file | claude), non-interactive mode (claude -p), parallel sessions, and CLAUDE.md project context files. Runs in containers and worktrees for sandboxing |
| Codex CLI | OpenAI's terminal agent. Runs entirely in the terminal with full shell access. Sandbox modes for safe execution. Designed for autonomy with human oversight via approval prompts |
| Gemini CLI | Google's entry. Self-described as 'the most direct path from your prompt to our model.' Lightweight, no browser or Electron overhead, just a terminal process |
The practical implication for harness engineering: design your agent as a CLI tool first, then add GUI layers on top. The CLI forces you to get the core loop, tool dispatch, and context management right without hiding problems behind a visual interface. If the agent works well in a terminal, adding a web UI or IDE extension later is a presentation concern — not an architecture change.
Context Management#
Context is the scarcest resource in an agent system. Every tool result, every reasoning step, and every piece of conversation history consumes tokens from a finite window. When the window fills up, the model either loses access to earlier information, or the harness must intervene to compress or evict content. How the harness manages this resource determines how long and how complex a task the agent can handle.
Context Management Strategies
As an agent works through a multi-step task, the conversation history grows with each iteration. The harness must manage this growth to prevent context overflow. Four primary strategies exist, each trading off information preservation against token efficiency.
Practical Context Management Patterns#
Persistent instructions belong outside the conversation. System prompts, project conventions, and tool descriptions should be loaded from configuration files (like CLAUDE.md or system prompt templates) rather than injected as user messages. This ensures they survive compaction and are always visible to the model regardless of how many conversation turns have elapsed.
Tool results are the biggest context consumers. A single file read or grep result can return thousands of tokens. Effective harnesses truncate tool results at the source — return the first 200 lines of a file rather than the full 5,000 — and let the model request more if needed. This is a form of lazy loading for context.
Pre-step hooks (called prepareStep in Vercel AI SDK, call_model_input_filter in OpenAI Agents SDK) allow the harness to modify the message array immediately before each LLM call. This is the right place to implement dynamic context pruning: drop tool results from steps that are no longer relevant, summarize long outputs, or inject fresh retrieval results.
Guardrails#
An agent that can call tools and take actions in the world needs safety boundaries. Guardrails are the harness components that validate inputs and outputs, enforce policies, and halt execution when something goes wrong. Without guardrails, a model that hallucinates a dangerous command or generates inappropriate content will execute it — the model has no built-in concept of "I should not do this."
Guardrail Architecture
Guardrails operate at three points in the execution loop: before the model sees the user input (input guardrails), before a tool executes (tool guardrails), and after the model produces its final answer (output guardrails). Each guardrail can either pass, modify, or trip — where tripping halts execution immediately.
Permission Models#
Production agent harnesses use tiered permission systems rather than a binary allow/deny. Claude Code's model is representative:
| Trust Level | Behavior | Use Case |
|---|---|---|
| Plan mode | Agent can only read and reason — all mutations blocked. Useful for reviewing what the agent would do before letting it act | Unfamiliar codebases, high-risk changes, auditing agent behavior |
| Default (confirm) | Read operations run freely. Mutating tools (file edit, shell exec) require explicit user approval per invocation | Most interactive development work — the user stays in the loop |
| Auto-accept | All tool calls execute without confirmation. The agent works fully autonomously | Trusted tasks with good test coverage, CI/CD pipelines, batch operations |
| Allowlist | Specific tools or commands are pre-approved. Everything else requires confirmation | Fine-grained control — approve safe commands like 'npm test' while requiring review for 'rm' or 'git push' |
The permission model is a spectrum between safety and speed. Moving toward more autonomy requires compensating controls: good test suites, sandboxed environments, version control, and the ability to roll back.
Observability#
Agent systems are inherently non-deterministic. The same input can produce different tool call sequences, different reasoning paths, and different final answers across runs. Traditional logging — recording inputs and outputs — is necessary but insufficient. Agent harnesses need structured tracing that captures the full execution tree: every LLM call, every tool invocation, every decision point, with timing and token usage at each step.
Agent Observability Stack
A production agent harness instruments three layers: traces (end-to-end request journeys), spans (individual operations within a trace), and metrics (aggregated measurements). This mirrors distributed systems observability but adapted for the non-deterministic nature of agent execution.
Key Metrics for Agent Systems#
| Metric | What It Tells You | Warning Signs |
|---|---|---|
| Steps per task | How many iterations the agent needs to complete a task. Lower is better — fewer steps means less cost and latency | Consistently hitting max_steps, or step count increasing over time for similar tasks |
| Tokens per task | Total input + output tokens across all LLM calls. Dominated by input tokens since context grows each step | Token usage growing superlinearly with task complexity, or single tool results consuming >30% of the context window |
| Tool error rate | Percentage of tool calls that fail (invalid args, timeouts, execution errors). The model self-corrects many errors, but high error rates waste iterations | Error rate above 10%, or the same tool failing repeatedly in the same way (indicates a tool design problem, not a model problem) |
| Task success rate | Percentage of tasks completed satisfactorily. Requires either automated evaluation or human review | Success rate dropping after a model update, a new tool addition, or a system prompt change — these are the most common causes of regressions |
| Latency distribution | End-to-end time from user request to final answer. Highly variable for agents due to variable step counts | P95 latency exceeding user tolerance, or bimodal distribution (fast successes + very slow failures — the slow tail is where the agent is stuck in retry loops) |
| Cost per task | Dollar cost of all LLM API calls for one task. Track per-task and per-user for budgeting | Average cost increasing over time, or occasional outlier tasks that cost 10–50× the median (runaway loops or context explosion) |
Error Handling and Recovery#
Agents fail in ways that traditional software does not. A web server either processes a request or returns an error. An agent can partially succeed — completing 8 of 10 steps correctly — and then fail in a way that corrupts the work done so far. The harness must handle not just tool-level errors but also agent-level failure modes: the model going off track, context overflow, and cascading mistakes.
Error Handling Layers
A robust agent harness handles errors at three levels: tool errors (a single tool call fails), loop errors (the agent gets stuck or goes off track), and system errors (the LLM API itself fails). Each level requires a different recovery strategy.
Checkpointing and Durable Execution#
For long-running agent tasks (minutes to hours), the harness should save progress at meaningful checkpoints — typically after each successful tool execution or at task boundaries. If the process crashes, the agent can resume from the last checkpoint rather than starting over.
| Strategy | How It Works | When to Use |
|---|---|---|
| Conversation checkpointing | Save the full message array (system prompt + history + tool results) to persistent storage after each step or after each successful task completion | Any agent task that takes more than a few minutes — the cost of re-running from scratch exceeds the cost of persistence |
| Durable execution frameworks | Integrate with Temporal, Restate, or DBOS to get automatic state persistence, crash recovery, and exactly-once execution guarantees for tool calls | Production systems where agent tasks run unattended (CI/CD, batch processing, background workers) and must complete reliably |
| Idempotency tokens | Assign unique IDs to mutating tool calls so that retried calls do not produce duplicate side effects (duplicate file writes, duplicate API calls) | Any agent that performs write operations — without idempotency, crash recovery can create duplicate changes |
Memory System#
Every agent conversation is stateless by default — when a session ends, everything the model learned about the user, the project, and past decisions vanishes. The next session starts from zero. The user must re-explain their role, re-state their preferences, and re-correct the same mistakes. This is not a minor UX annoyance; it is a fundamental limit on how useful an agent can be over time.
A memory system solves this by persisting key facts across conversations. It is not a chat history replay — it is a curated knowledge base that stores what matters and discards what can be derived from code, git history, or documentation.
Memory System Architecture
The memory system operates through two channels: an always-on index loaded into every conversation's system prompt, and a per-query semantic recall that selects relevant memories based on what the user is currently asking. Memories are stored as markdown files on disk — simple, inspectable, and version-controllable.
What Gets Stored (and What Doesn't)#
The key insight in memory system design is that memory should only store what cannot be derived from the current state of the codebase. Code patterns, file structure, and git history are always available by reading the project — storing them in memory creates stale duplicates that conflict with reality.
| Memory Type | What It Captures | Example |
|---|---|---|
| User | Role, expertise, goals, preferences — who the user is and how to tailor responses | 'User is a data scientist, deep Python expertise, new to TypeScript — frame TS explanations in terms of Python analogues' |
| Feedback | Corrections AND confirmations — both what to stop doing and what to keep doing | 'Don't mock the database in tests — prior incident where mocks masked a broken migration. Use the test fixture instead' |
| Project | Decisions, deadlines, incidents — context not derivable from code or git | 'Merge freeze begins March 5 for mobile release cut. Auth rewrite is driven by legal/compliance, not tech debt' |
| Reference | Pointers to external systems — where to find information outside the codebase | 'Pipeline bugs tracked in Linear project INGEST. Oncall dashboard at grafana.internal/d/api-latency' |
What is explicitly not stored: code conventions (read the code), architecture decisions (read ADRs or CLAUDE.md), git history (run git log), debugging solutions (the fix is in the code), or ephemeral task state (use a task list instead). These exclusions prevent memory from becoming a stale mirror of information that has a better source of truth.
Putting It All Together#
The components covered in this chapter — execution loop, tool dispatch, context management, guardrails, observability, error handling, and memory — are not optional features. They are the minimum viable harness for a production agent. Missing any one of them creates a specific class of failure:
| Missing Component | What Fails |
|---|---|
| No step limit | Runaway loops that burn budget indefinitely |
| No schema validation | Silent tool failures from malformed arguments that cascade into wrong answers |
| No context management | Tasks fail or degrade after 10–15 steps as the context window fills up |
| No guardrails | The model executes dangerous or inappropriate actions with no safety net |
| No observability | You cannot debug failures, measure costs, or identify performance regressions |
| No error recovery | A single tool failure or API hiccup kills the entire task, losing all progress |
| No memory | Every session starts from zero — users re-explain preferences, re-correct mistakes, and the agent never learns from past interactions |
The Complete Agent Harness
All six components working together form the agent harness. The execution loop is the spine; everything else wraps around it. Input guardrails filter before the model sees the request. The loop dispatches tools through the registry with schema validation. Context management keeps the conversation within token limits. Observability traces every step. Error handling ensures resilience at every layer.
Summary#
| Concept | Key Point |
|---|---|
| Agent harness | The infrastructure code that wraps an LLM and turns it into an autonomous agent — the execution loop, tool dispatch, context management, guardrails, observability, and error handling |
| Execution loop | Turn-based loop: send context to LLM, inspect response, execute tools or return final answer. Every framework implements this same pattern. Always enforce a step limit |
| Tool dispatch | Registry-based lookup with schema validation before execution. Return errors to the model as tool results for self-correction. Truncate oversized results to protect context |
| CLI-first design | The terminal is the natural interface for coding agents — text-native, composable via Unix pipes, scriptable for CI/CD, sandboxable in containers, and decoupled from any specific editor |
| Context management | The scarcest resource. Strategies: sliding window truncation, auto-compaction via summarization, subagent delegation, and external memory. Persistent instructions belong in config, not conversation |
| Guardrails | Input, tool, and output guardrails that validate, filter, or halt execution. Permission models range from plan-only (read-only) to full auto-accept, with allowlists for fine-grained control |
| Observability | Structured tracing (traces → spans → metrics) captures the full execution tree. Key metrics: steps per task, tokens per task, tool error rate, task success rate, cost per task |
| Error handling | Three layers: tool errors (return to model for self-correction), loop errors (detect stuck agents, intervene), system errors (retry with backoff, fall back to alternate model) |
| Memory | File-based persistence of user preferences, feedback, project context, and external references across sessions. Semantic recall selects relevant memories per query. Survives compaction because memories live on disk, not in the conversation transcript |
| Checkpointing | Save conversation state at meaningful boundaries so long-running tasks can resume after failures instead of restarting from scratch |
| Framework choice | Start with a framework (Vercel AI SDK, OpenAI Agents SDK, LangGraph, CrewAI). Understand the raw loop first, then use the framework — you cannot debug what you do not understand |
The most important insight in agent harness engineering is that the harness matters as much as the model. A frontier model with no step limit, no context management, and no error handling will fail on tasks that a smaller model with a well-engineered harness handles reliably. The model provides the intelligence; the harness provides the reliability. You need both — invest in them equally.
Sources:
- Building Effective Agents — Anthropic Research
- How Claude Code Works — Anthropic
- OpenAI Agents SDK Documentation
- Vercel AI SDK — Building Agents
- CrewAI Documentation
- AutoGen Architecture — Microsoft
- LLM Powered Autonomous Agents — Lilian Weng
- LangChain: What Is an Agentic System
- Amazon Bedrock: How Agents Work