AI Agents
A traditional web server does exactly what you tell it to: receive a request, run code, return a response. The path through the code is fixed. When a language model is integrated into that system, however, it can do something fundamentally different: it can decide what to do next, call tools to gather information, observe the results, and then decide again — all without a human in the loop.
This is the core idea behind AI agents: systems where a language model drives the control flow rather than just answering a single question. Instead of one prompt → one response, an agent runs a loop — reasoning, acting, observing — until a goal is achieved.
Before exploring agents in depth, most production systems encounter a more foundational infrastructure question: how do you manage access to multiple language models in the first place?
LLM Gateways#
A real application rarely uses a single model from a single provider forever. You might start with GPT-4o for its reasoning quality, but switch to Claude for tasks where it performs better, or fall back to a smaller open-source model when cost is a constraint. You might want to rate-limit certain users, track spend per feature, or retry failed calls automatically.
An LLM gateway is a service that sits between your application and the underlying model providers. It standardizes the API surface so your code doesn't need to know which provider it is talking to, and adds cross-cutting concerns — routing, fallback, cost tracking, rate limiting — in one place.
LLM Gateway Architecture
Without a gateway, each service calls model providers directly, duplicating auth, retry, and logging logic everywhere. With a gateway, all LLM traffic flows through a single control plane that enforces policy, routes requests, and observes costs uniformly.
Popular gateway options:
| Tool | Type | Key Strengths |
|---|---|---|
| LiteLLM | Open-source SDK + proxy server | Supports 100+ providers with one unified OpenAI-compatible API. Built-in retry, fallback, load balancing, cost tracking, and per-user rate limiting. Self-hostable proxy server for teams |
| OpenRouter | Managed cloud gateway | Aggregates 200+ models from dozens of providers into one API endpoint. Pay-per-use, no infrastructure to run. Good for accessing obscure or experimental models |
| Portkey | Managed cloud gateway | Enterprise focus: detailed observability, prompt versioning, A/B testing, guardrails, and a visual dashboard. Strong on auditability and compliance requirements |
| Custom proxy | Self-built | Full control. Start here only when you have requirements that off-the-shelf gateways cannot meet — the operational cost of maintaining a custom gateway is high |
The most common starting pattern with LiteLLM: wrap your model calls behind LiteLLM's unified interface and configure a fallback list. If the primary model fails (rate limit, outage, timeout), LiteLLM automatically retries the next model in the list — no custom retry logic required. This is the minimum viable gateway for most teams.
What Is an AI Agent?#
A chatbot answers one question per call. You send a message, it sends a reply, and the interaction is complete. The model has no ability to take actions in the world or to run additional reasoning steps based on what it learned.
An AI agent is different: it is a system where the language model runs in a loop, calling tools, observing results, and reasoning again until a goal is complete. The model controls the flow of execution. Instead of your code deciding what to do at each step, the model decides — your code only provides the tools and runs the loop.
The Agent Loop: Observe → Think → Act
An agent runs in a cycle. On each turn, it observes the current state (conversation history, tool results, environment data), thinks about what to do next, and acts by either calling a tool or producing a final answer. This loop continues until the goal is achieved or a stopping condition is met.
The ReAct pattern (Reason + Act) is the standard way to implement the reasoning step. Before taking any action, the model first writes out its reasoning in plain text — what it knows, what it still needs, and why it is choosing a particular tool. This scratchpad of reasoning becomes part of the context for the next iteration, helping the model stay consistent over many steps and making its decisions easier to inspect and debug.
[Iteration 1]
Thought: I need to find the top AI papers from this week. I'll search for recent publications.
Action: web_search("top AI research papers March 2026")
Observation: [search results returned with 10 papers listed]
[Iteration 2]
Thought: I found 10 papers. The top 3 by citation count are X, Y, Z.
Now I need to get the abstract for each.
Action: web_fetch("https://arxiv.org/abs/...")
...
Without the explicit reasoning step, agents tend to make uncoordinated tool calls and lose track of their overall progress. With it, they produce an auditable trail of why each decision was made.
The Three Core Properties of an Agent#
Every AI agent, regardless of the framework or model used, is built from three building blocks: tool use, memory, and planning.
Tool Use#
A language model on its own can only produce text. Tools give it the ability to act in the world: run code, read files, call APIs, query databases, browse the web, or send messages. Tools are the only way an agent can affect state outside the model itself.
From a system design perspective, tools are defined by three things:
- A name and description — the model reads this description to decide when to use the tool. A vague description leads to the model calling the wrong tool or missing the right one.
- A typed input schema — the model generates a JSON object matching the schema. The schema is validated before the tool runs.
- An execution function — your code that runs when the model calls the tool and returns a result.
The result is appended to the conversation context as a "tool result" message, and the model reads it on the next iteration. This is the mechanism that connects the model's reasoning to real-world state.
Tool design is where agents succeed or fail. If a tool name is ambiguous, the model will misuse it. If a tool's inputs are too broad, the model will produce invalid calls. If a tool returns too much data (a 50,000-token API response), it eats the entire context window. Treat tool design with the same care as an API you would expose to a user.
| Tool Design Principle | Good Example | Bad Example |
|---|---|---|
| Specific names | search_product_catalog(query, max_results) | search(input) — too vague, model can't tell what this searches |
| Scoped return values | Return the top 5 results with only the fields the model needs | Return the entire API response including every field — bloats context |
| Atomic operations | create_calendar_event(...) — does one thing | manage_calendar(action, ...) — does multiple things via a parameter |
| Idempotent where possible | Read operations: always safe to retry | Write operations without deduplication: retries cause double-sends, duplicate records |
Memory#
An agent's memory is split into two tiers that work very differently.
In-context memory is the conversation history in the current context window (the maximum amount of text the model can read and process in a single call). It is fast and immediately available to the model — everything in the window is visible to the model on every iteration. But it is strictly finite (bounded by the context window size), it disappears when the session ends, and it gets expensive as it grows. This is the agent's "working memory" for the current task.
External (long-term) memory is stored outside the model — in a vector database, a relational database, or a key-value store. It persists across sessions and scales without bound. But accessing it requires an explicit retrieval step — either a direct tool call or a RAG (Retrieval-Augmented Generation) lookup, where the agent queries a vector database for semantically relevant information — which adds latency and an extra point of failure. This is the agent's "long-term memory" for facts, user preferences, past decisions, and knowledge that should survive session boundaries.
A practical note on memory: Most agent frameworks default to keeping the entire conversation history in context. This works for short tasks but degrades for long ones — at some point the history fills the window, the cost per call escalates, and the model starts losing track of early context. For production agents handling extended sessions, apply the same strategies covered in the Context as a Resource tutorial: sliding window truncation, conversation summarization, or external memory retrieval.
Planning#
Planning is the ability to break a complex goal into a sequence of subtasks and execute them in the right order. Without planning, an agent can only react to the immediate situation. With planning, it can pursue multi-step goals that require preparation, backtracking, and error recovery.
For simpler tasks, planning happens implicitly — the model's reasoning step produces a natural sequence of tool calls without any explicit plan. For more complex tasks, agents often benefit from an explicit planning step: before taking any action, the agent generates a written plan, then executes it step by step, checking after each step whether the plan still makes sense.
A key design decision: how much trust do you put in the plan? If an agent creates a 10-step plan and commits to it, it may complete steps 3–10 based on wrong assumptions from step 2. Production agents often re-evaluate after each step — treating the plan as a living document rather than a fixed program. This adds iterations (and cost) but prevents error propagation.
Multi-Agent Patterns#
Single agents have practical limits. A context window can only hold so much history, a single agent instance cannot process multiple subtasks simultaneously, and a monolithic agent that handles every kind of subtask becomes hard to test and improve. Multi-agent systems address these limits by coordinating multiple specialized agents.
The same way a microservices architecture splits a monolith into focused services, a multi-agent architecture splits a complex task into focused agents — each doing one thing well, with defined interfaces between them.
Multi-Agent Topologies
There are four primary ways to connect agents together. Each topology is appropriate for a different class of problem. Choosing the wrong topology leads to unnecessary complexity, latency, or failure modes.
The Four Topologies in Detail#
Sequential (Pipeline): Agents are chained in a fixed order. Agent A produces output, Agent B transforms it, Agent C validates it. Each stage can be optimized independently — you can use a fast, cheap model for research and a powerful model only for the final writing step. The failure mode is error propagation: a hallucination in step 1 flows silently through every subsequent stage.
Parallel (Fan-out / Fan-in): An orchestrator splits a task into independent subtasks and dispatches them to workers simultaneously. Workers run in parallel. The orchestrator merges the results. This is the right pattern when subtasks are truly independent — if worker B's output depends on worker A's output, they are not independent and parallel dispatch is wrong. The failure mode is partial failure: if one of five workers fails, you must decide whether to retry that worker, skip it, or abort the whole task.
Evaluator-Optimizer: One agent generates output; a separate agent evaluates it and provides feedback; the generator revises. The loop continues until the evaluator approves or a maximum iteration count is reached. This pattern is powerful for tasks with clear quality criteria — code that must pass tests, text that must meet a rubric, translations that must preserve specific terminology. The failure mode is oscillation: the generator and evaluator can cycle indefinitely if the criteria are contradictory or the evaluator gives vague feedback that the generator cannot act on.
Orchestrator-Workers: An orchestrator agent dynamically plans which workers to use and in what order, based on what it discovers at runtime. Unlike the fixed sequential pipeline, the plan is not predetermined — the orchestrator adapts. Workers are typically specialists: one for web search, one for code execution, one for database queries. This is the most flexible topology and the most common in production AI coding agents (like Claude Code). The failure mode is the orchestrator making a bad plan that spawns unnecessary or circular work.
Failure Modes Unique to Multi-Agent Systems#
Single-agent systems fail when the model hallucinates, calls the wrong tool, or runs out of context. Multi-agent systems have all those failure modes plus three new ones that emerge from the interaction between agents.
| Failure Mode | What It Looks Like | How to Prevent It |
|---|---|---|
| Cascading errors | Agent A returns a subtly wrong result. Agent B consumes it without validation and builds on it. Agent C does the same. The final output is confidently wrong, and the trace of the original mistake is buried in agent A's output three steps back | Add validation checkpoints between stages. Have the orchestrator verify critical outputs before passing them downstream. Log each agent's input and output separately so failures can be traced to their origin |
| Diverging state | Two parallel workers both read and then modify the same shared resource (a file, a database record, a configuration). They each make changes based on stale data. Their writes conflict. The final state reflects whichever write happened last — not the intended merged state | Treat shared state as a coordination problem, not a data problem. Use optimistic locking, event sourcing, or route all writes through a single agent with exclusive access to the resource |
| Infinite loops | The evaluator agent keeps rejecting the generator agent's output without providing actionable feedback. The generator keeps trying different approaches. The loop runs until you hit a token budget limit or a wall-clock timeout — having consumed significant cost without producing output | Always set a maximum iteration count for any feedback loop. Monitor loop depth as a metric. Design the evaluator to produce specific, actionable feedback — 'this is wrong' is not enough; 'this is wrong because X, fix it by doing Y' breaks the loop |
When to Use Multi-Agent vs. Single Agent#
The default should always be: start with a single agent and add more only when you have a concrete reason to.
| Signal | Recommendation |
|---|---|
| The task fits in one context window and requires no parallelism | Single agent — simpler, cheaper, easier to debug |
| The task has a clear sequential structure with 2–4 stages, each suited to a different model or prompt | Sequential pipeline — low coordination overhead |
| The task decomposes into ≥ 3 truly independent subtasks that together exceed latency tolerance | Parallel fan-out — only when parallelism gives a meaningful speedup |
| Output quality requires iterative refinement with explicit evaluation criteria | Evaluator-optimizer — budget for 2–5 iterations maximum |
| The task is open-ended, the required subtasks cannot be known upfront, and workers are well-defined specialists | Orchestrator-workers — most complex, highest operational overhead |
| The agents need to share mutable state or call each other in a cycle | Redesign — this is a code smell; circular dependencies between agents produce unpredictable behavior |
The practical rule from Anthropic's own research: in their SWE-bench coding agent, the team spent more time designing and optimizing individual tools than designing the overall multi-agent architecture. Good tools with a single agent often outperform a sophisticated multi-agent system with poorly designed tools. Start with the tools. Add agents when the tools are solid.
Putting It Together: The Request Path for Intelligence#
When a user sends a request to an AI-powered system in production, it flows through all the components covered in this tutorial. The orchestrator receives the user request directly, coordinates workers and memory retrieval, and each agent makes its LLM calls through the shared gateway. The gateway is not on the user request path — it is on the model call path, used by every agent in the system whenever they need to reason.
The Full AI Request Path
The LLM Gateway is shared infrastructure for model calls, not a front door for user requests. Every agent — orchestrator, workers, evaluator — routes its LLM API calls through the gateway. The user request goes directly to the orchestrator, which coordinates the rest.
A note on agent frameworks: LangChain, LlamaIndex, CrewAI, AutoGen, and others abstract the agent loop, tool registration, and multi-agent coordination into higher-level APIs. Frameworks accelerate prototyping but add abstraction layers that can obscure what the model is actually doing and make debugging harder. The recommendation from practitioners building production agents: understand the raw loop first — you can always add a framework on top of something you understand, but frameworks are hard to debug when you do not know what they are hiding.
Summary#
| Concept | Key Point |
|---|---|
| LLM gateway | A single control plane for all LLM traffic: unified API surface, routing, fallback, rate limiting, cost tracking. LiteLLM is the most common open-source choice |
| What is an agent | An LLM running in a loop — Observe → Think → Act — that controls the flow of execution. Unlike a chatbot, it calls tools and reasons over results across multiple iterations |
| ReAct pattern | The model writes out its reasoning before each action. This reasoning becomes part of the context, keeping the agent consistent over many steps and producing an auditable trace |
| Tool use | The only way an agent affects the world. Good tool design — specific names, scoped returns, atomic operations — is the single biggest lever on agent reliability |
| In-context memory | Fast, immediately available, but finite and ephemeral. Everything in the context window is visible to the model. Grows with each iteration |
| Long-term memory | External store (vector DB, relational DB, key-value). Persists across sessions, scales without bound, requires an explicit retrieval step |
| Sequential pipeline | Fixed stages in order. Best for tasks with a clear, predetermined sequence. Failure mode: error propagation |
| Parallel fan-out | Independent subtasks run simultaneously, merged at the end. Best for truly independent workloads. Failure mode: partial failure handling |
| Evaluator-optimizer | Generator + evaluator in a feedback loop. Best for tasks with explicit quality criteria. Failure mode: oscillation — set a max iteration count |
| Orchestrator-workers | Dynamic planning with specialist workers. Best for open-ended tasks where the plan cannot be known upfront. Most complex topology |
| Multi-agent failure modes | Cascading errors (validate between stages), diverging state (coordinate writes), infinite loops (max iteration limits) |
| When to go multi-agent | Default to a single agent. Add agents when you have a concrete reason: parallelism, specialization, or scale that a single context window cannot handle |
The engineers who build reliable AI agent systems do not start with the most sophisticated architecture available — they start with the simplest one that works and add complexity only when they can measure the benefit. Agents are powerful because they are flexible; that same flexibility is what makes them expensive to debug when they fail. Build simple, measure constantly, and add agents only when the tools are solid.
Sources: