Observability

When something goes wrong in production, the first question is always: "What is actually happening inside my system right now?"

Observability is the ability to understand a system's internal state by examining the signals it emits — its logs, metrics, and traces. It is distinct from traditional monitoring, which watches for known failure modes (e.g., "alert me when CPU > 90%"). Monitoring tells you that something broke. Observability lets you ask why — even for failures you have never seen before, including emergent behaviors introduced by AI-generated code or an unpredictable LLM.

This distinction matters more than ever as AI agents become a core part of the software stack. When an AI agent calls an external LLM, executes a tool, and writes to a database — all in one turn — a single slow database query can cause a response to exceed your latency budget without any obvious alert firing. You need observability to trace exactly which step was slow, across which services, for which user.

AI agents also have a consistent gap: they generate functionally correct application code but almost never add observability. They do not instrument functions with traces, do not track token usage or cost, and do not record what prompts and completions flow through an LLM call. Without explicit instructions, the AI produces a system you cannot reason about in production.

This section covers the three core techniques you need to build observable systems:

  • The Three Pillars — Logs, Metrics, and Traces: what they each tell you and how they work together
  • Causal Tracing — how to follow a single request through multiple services and an AI agent
  • SLIs/SLOs — how to define and measure what "good" means, specifically for AI-powered systems

The Three Pillars#

Every observable system emits three types of telemetry data. They answer different questions and serve different purposes during an incident.

The Three Pillars of Observability

Logs, metrics, and traces are not alternatives — they are complementary. Metrics detect that something is wrong. Traces show which service or step is responsible. Logs provide the detailed context to understand why. In a well-instrumented system, an alert fires on a metric, you click through to the relevant trace, then drill into the specific logs for that request.

Rendering diagram...

Logs: The Event Record#

A log is a timestamped record of a discrete event — a request arrived, a query ran, an error occurred. Logs are the oldest observability tool and the most familiar: they answer "what happened at this moment?"

Modern production systems use structured logging — each log line is a JSON object rather than a free-form string. Structured logs are machine-queryable: you can filter all error logs for a specific user_id, or find all logs where llm_provider is anthropic and tokens_used exceeds 4,000, in seconds.

An unstructured log is hard to query at scale:

ERROR 2026-03-13 14:32:07 LLM call failed for user 1234: rate limited

A structured log is queryable, filterable, and alertable:

{
  "level": "error",
  "timestamp": "2026-03-13T14:32:07.412Z",
  "trace_id": "abc123def456",
  "service": "ai-gateway",
  "event": "llm_call_failed",
  "user_id": "1234",
  "llm_provider": "anthropic",
  "error_code": 429,
  "error_message": "rate_limit_exceeded",
  "tokens_attempted": 3800
}

Log levels are a signal-to-noise dial:

LevelMeaningProduction Guidance
ERRORSomething failed that requires investigation — a request failed, an API call errored, data was not writtenAlert on sustained error rates; every ERROR should be actionable
WARNSomething unexpected happened but the system recovered — a retry succeeded, a cache miss fell back to DBMonitor trends; a rising WARN rate often precedes ERROR spikes
INFONormal operational events — a request completed, a job finished, a user authenticatedKeep INFO sparse in production — log key lifecycle events, not every function call
DEBUGVerbose details useful during development — intermediate values, state transitions, SQL queriesDisable in production by default; enable temporarily per-service when investigating a specific issue

Metrics: The Health Dashboard#

A metric is a numerical measurement collected over time — error rate, request count, p99 latency, CPU usage, token cost per hour. Metrics answer "is the system healthy right now?" and "is it trending worse?"

Unlike logs, metrics are not per-event records. A metrics system stores one numeric measurement per collection interval (typically every 15–60 seconds), regardless of how many individual events occurred during that window. A histogram, for example, records how many requests fell into each latency bucket per interval — not one entry per request. This makes metrics extremely efficient to query and display on dashboards: you can render a 7-day latency trend for a service with millions of requests in milliseconds.

The three most common metric types:

TypeWhat It MeasuresExample
CounterA value that only increases — total number of events since startupTotal requests served, total errors, total LLM tokens consumed
GaugeA value that can go up or down — a snapshot of current stateCurrent active connections, current queue depth, current memory usage
HistogramDistribution of values — how requests are spread across latency buckets, enabling percentile calculationp50/p95/p99 request latency, distribution of token counts per LLM call

Percentiles (p50, p95, p99) matter more than averages for latency metrics. An average can hide that 1% of users wait 10 seconds while the rest wait 100ms. The p99 latency (the 99th-percentile response time — meaning 99% of requests complete faster than this value) surfaces the worst-case experience that real users encounter.

Traces: The Request Journey#

A trace is the complete record of a single request as it travels through your system — across services, databases, queues, and external APIs. Traces answer "why was this specific request slow?" — not in aggregate, but for one real user's experience.

Every trace is built from spans. A span represents one unit of work: a single function call, a database query, an HTTP call to an external API. Spans have:

  • A trace ID — a shared identifier that links all spans belonging to the same request
  • A span ID — a unique identifier for this specific operation
  • A parent span ID — which span initiated this one, forming a tree
  • Start time and duration — when did it start and how long did it take
  • Attributes — key-value metadata (service name, HTTP status code, DB query, etc.)

The result is a waterfall diagram: a visual timeline of every step in a request, showing exactly where time was spent.

OpenTelemetry: The Instrumentation Standard#

Before OpenTelemetry, the observability landscape was fragmented. Each vendor (Datadog, New Relic, Splunk) offered its own proprietary SDK. While earlier open efforts like OpenTracing (2016) and OpenCensus (2018) existed, they were separate and incompatible standards. Instrumenting your code meant choosing a vendor or standard upfront — switching later required rewriting all your instrumentation.

OpenTelemetry (OTel), formed in 2019 from the merger of OpenTracing and OpenCensus, solves this with a single vendor-neutral, open standard for generating logs, metrics, and traces. Instrument your code once with the OpenTelemetry SDK; then send the data to any backend without changing your application code.

Rendering diagram...

OpenTelemetry is the most important observability investment a team can make: by standardizing on OTel today, you preserve the ability to switch backends, combine tools, and avoid vendor lock-in — without ever touching your application code again.

Causal Tracing: Following a Request Through Your System#

When your system is a single server calling a single database, debugging is straightforward: search the log file, find the error. When your system spans five microservices, an AI agent, a vector database, and an external LLM provider, a single user request touches all of them — and a slow response could originate anywhere.

Causal tracing (distributed tracing) is the practice of tracking a single request end-to-end across all services, using a shared trace ID that propagates through every hop. When you open the trace for a slow request, you see a waterfall of exactly which service and which operation consumed the time.

How Trace Propagation Works#

The first service that receives the request generates a trace ID — a globally unique identifier (typically a 32-character hexadecimal string). This trace ID travels with every outgoing HTTP call as a request header — traceparent in the W3C standard, or a custom header like x-trace-id in older setups. Each downstream service reads the incoming header, creates a child span under the same trace ID, and forwards the header to any services it calls in turn.

The result: all spans share the same trace ID and can be assembled into a complete picture of the request.

Distributed Trace: User Request Through Microservices and an AI Agent

A user's search request flows through five services: API Gateway, Auth Service, AI Gateway, LLM Provider, and the Database. Each service creates a child span under the same trace ID. The waterfall view reveals that the LLM call accounts for 87% of the total response time — the root cause of the slow response.

Rendering diagram...

The trace ID in logs is the critical link: When engineers investigate an incident, they start with a metric alert ("error rate is 8%"), find a representative trace, and then search logs filtered by that trace's trace_id to see the raw event details. Without the trace ID in your log lines, you cannot make this connection. Always log the trace_id on every line in a request's handler — it is the thread that ties all three pillars together.

Tracing an AI Agent Turn#

An AI agent turn is not a single function call — it is a loop: observe context → call LLM → parse response → call a tool → observe result → call LLM again. The loop exists because the agent cannot complete its answer in one step when it needs external information: it must call a tool, wait for the result, and only then can the LLM decide what to do next. Each iteration produces a new set of spans, all sharing the same trace ID for the user's original request.

Rendering diagram...

In this trace, the waterfall will show two llm.call spans and one tool.execute span, all under the same trace_id: abc123. If the second LLM call is unexpectedly slow, you can see it immediately. Without this trace, you would only know that the overall agent response took too long — you would have no way to tell whether the delay was in the first LLM call, the tool execution, or the second LLM call.

Token accounting per span: For each llm.call span, record the input tokens, output tokens, and estimated cost as span attributes. In a multi-turn agent, token usage compounds across turns: each subsequent LLM call includes the full conversation history from previous turns as context. A trace that shows span 1 used 1,200 tokens and span 2 used 3,800 tokens tells you the tool result was very large — possibly because the database query returned far more data than the agent actually needed, inflating both cost and latency.

SLIs, SLOs, and SLAs: Defining "Good"#

Before you can know whether your system is working well, you need to agree on what "working well" means — precisely enough that you can measure it automatically. The SLI/SLO/SLA framework gives you the vocabulary to do this.

TermWhat It IsExample
SLI (Service Level Indicator)The actual measurement — a specific metric you collect to describe reliability or qualityThe fraction of API requests that complete in under 500ms over a rolling 30-day window
SLO (Service Level Objective)Your internal target — the threshold the SLI must meet for the service to be considered healthy99.5% of requests must complete in under 500ms (the SLI must reach this target)
SLA (Service Level Agreement)A formal contract with customers — the minimum acceptable level, with financial or legal consequences if broken99.0% uptime per month; customers receive service credits if this is breached
Error BudgetThe allowable room between your SLO and 100% — how much unreliability you can spend on deployments, experiments, and incidents before the SLO breaksWith a 99.5% SLO, the error budget is 0.5% of requests = ~3.6 hours of downtime per month

The relationship between the three: SLIs are what you measure. SLOs are what you target internally. SLAs are what you promise externally. A well-run team sets SLOs tighter than their SLA to maintain a buffer — if the SLA is 99.0% uptime, an internal SLO of 99.5% means the team can absorb some degradation before the customer contract is actually at risk.

Error budgets are the most operationally important concept: they convert the abstract question "how reliable is the service?" into the concrete question "how much can we safely ship this month?" A team with a full error budget can deploy aggressively. A team with an exhausted budget should freeze risky changes and focus on reliability work instead. This alignment is the real value: reliability is a finite resource, and spending it carelessly on bad deploys has a measurable cost that both engineering and product can see.

SLO Error Budget: Reliability as a Shared Resource

An error budget makes reliability concrete and actionable. When the budget is healthy, teams ship fast. When it is depleted, the team slows down and stabilizes. This turns a subjective debate ('should we deploy this risky change?') into an objective question ('do we have enough budget to absorb a 0.2% error rate increase if this deploy goes wrong?').

Rendering diagram...

SLIs for AI Systems#

Defining "good" for an AI system requires additional SLIs that do not exist in traditional services. A system where the LLM always responds within 200ms is not "good" if the responses are factually wrong or cost $50 per user session.

SLIWhat It MeasuresTarget ExampleWhy It Matters for AI
AvailabilityFraction of time the service responds to requests at all99.9% of requests receive a response (not a 5xx error)LLM providers have real outages; circuit breakers and fallbacks protect this SLI
Request latency (p99)99th-percentile end-to-end response time per request99% of requests complete in under 3 secondsAI responses are inherently slower than database queries — budget accordingly
Time-to-first-token (TTFT)How long the user waits before streaming output beginsTTFT under 800ms for 95% of requestsUsers tolerate long AI responses if streaming starts quickly — TTFT drives perceived speed
Error rateFraction of requests that result in an error (LLM failures, timeouts, parsing errors)Under 0.5% of requests return an error to the userLLM API rate limits and provider outages make error rates more volatile than in traditional services
Token cost per requestAverage LLM token spend per user request, in USDUnder $0.005 per search requestAgent loops and prompt regressions can silently multiply cost — track this as a cost SLI
Quality scoreFraction of responses rated acceptable by automated evaluation or human feedbackOver 90% of responses pass automated quality checksA fast, available system that gives wrong answers has failed — quality must be measured

Quality measurement is the hardest SLI to implement but the most important one unique to AI systems. There are three practical approaches:

  1. Automated evaluation (LLM-as-judge): A second LLM scores the primary model's responses against a rubric — for example, checking whether an answer is factually grounded in the provided context, or whether it follows the expected format. Fast and scalable, but requires careful prompt engineering and periodic calibration against human judgement to ensure the evaluator itself is accurate.
  2. Human feedback loops: Surface thumbs-up/thumbs-down ratings in the UI and aggregate them into a quality metric over time. Slower and sparser than automated evals, but provides ground-truth signal that can be used to calibrate and validate your automated evaluators.
  3. Retrieval precision (for RAG systems): Measure whether the retrieved context chunks were actually relevant to the query — a proxy for output quality that does not require evaluating the final answer text at all. If the retrieval step consistently returns irrelevant chunks, the LLM cannot produce a good response regardless of how capable it is.

Time-to-first-token deserves special attention. When an AI response streams progressively — each word appearing as it is generated, rather than the full text appearing at once — the user's perception of speed is dominated by how quickly that first word arrives, not by the total completion time. A response that starts streaming at 600ms and finishes at 4 seconds feels faster than a response that is silent for 3 seconds and then appears in full. The TTFT SLI captures this perceptual reality directly, making it a more user-aligned measure of AI response quality than end-to-end latency alone.

What AI Agents Miss#

When you ask an AI agent to build an API endpoint, it will produce the handler, the input validation, and the database call. It will almost never produce the observability layer. This is not a random omission — it is a consistent and predictable gap. Observability is cross-cutting infrastructure, not part of the feature's functional requirements, and agents optimize for functional correctness over operational readiness.

What AI Agents SkipWhat You Need to AddImpact of Skipping It
Trace instrumentation on service callsAdd a child span for every external call: database, HTTP, LLM, queueSlow requests cannot be diagnosed — you know something is slow but not where
Structured log lines with trace IDsLog key events (request received, LLM called, error occurred) with trace_id in JSON formatLogs cannot be correlated to traces; errors are disconnected from their context
LLM call metrics (tokens, cost, latency)Record input tokens, output tokens, model name, and cost as span attributes and metrics on every LLM callToken costs are invisible until the bill arrives; agent loops go undetected
SLO-aligned metric countersInstrument request latency histograms and error rate counters at every service entry pointYou cannot define or enforce SLOs without the underlying measurements
Trace propagation headersForward the trace ID header (traceparent) in all outgoing HTTP callsDownstream services appear as disconnected traces — the full request journey is invisible
AI quality loggingLog the prompt, completion, and a quality signal (user feedback or eval score) for sampled LLM callsQuality regressions after a model update or prompt change go undetected until user complaints

The practical workflow: after an AI agent generates a service, run through this checklist before accepting it. The agent can add all of this if you ask explicitly — for example: "Add OpenTelemetry tracing with a child span for each database call and LLM call. Include token count and estimated cost as span attributes. Log all errors as structured JSON with the trace ID. Add a request latency histogram metric at the service entry point." Without this level of specificity, the agent will produce code that works but cannot be operated in production.

Summary#

ConceptKey Point
Observability vs. monitoringMonitoring detects known failures. Observability lets you investigate unknown ones — including emergent behaviors from AI-generated code or unpredictable LLMs.
Three pillarsMetrics alert you that something is wrong. Traces show which service or step is responsible. Logs explain why. Always correlate them with a shared trace ID.
Structured loggingLog JSON, not free-form strings. Always include trace_id. Use ERROR/WARN/INFO levels deliberately — avoid DEBUG in production.
Percentiles over averagesUse p95 or p99 for latency SLIs. Averages hide the worst-case user experience that real users encounter.
OpenTelemetryInstrument once, send to any backend. The industry standard — adopting it early prevents vendor lock-in and enables backend switching without code changes.
Trace ID propagationThe trace ID must travel as an HTTP header to every downstream service. A service that does not forward it breaks the causal chain.
AI agent tracingEach LLM turn and tool call is a child span under the same trace ID. Record input/output tokens and cost as span attributes on every LLM call.
SLI/SLO/SLASLI = what you measure. SLO = your internal target. SLA = your customer contract. Set SLOs tighter than SLAs to maintain a buffer.
Error budgetThe allowable unreliability between your SLO and 100%. Use it to make deploy decisions objective: full budget = ship fast; depleted = stabilize first.
AI-specific SLIsAdd time-to-first-token, token cost per request, and quality score to your standard latency/availability/error-rate SLIs. Cost and quality regressions are invisible without them.
AI agent gapAI agents almost never add observability. Explicitly instruct the agent: 'Add OTel spans, structured logs with trace IDs, and token cost attributes to every LLM call.'

An observable system does not mean a perfect system — it means a system where failures are discoverable. When the AI agent introduces a bug, a prompt regression silently degrades quality, or a new model version triples your token cost, your traces, metrics, and logs will tell you exactly what changed and where to look. A system you cannot observe is a system you cannot operate in production — no matter how well the AI wrote the code.

Sources: