Fault Tolerance
In distributed systems, failures are not edge cases — they are the norm. Network connections drop. Third-party APIs go down. Database replicas lag. Processes crash under memory pressure. The question is never whether something will fail, but how your system responds when it does.
Fault tolerance is the design property that allows a system to continue operating — possibly at a reduced level — in the presence of failures. A fault-tolerant system does not eliminate failures; it contains them, so a single component breaking does not cascade into a full outage.
This section covers three complementary techniques:
- Circuit Breakers — detect a failing dependency and stop sending requests to it, returning a fast error or fallback immediately
- Graceful Degradation — reduce functionality deliberately rather than fail completely when a dependency is unavailable
- Chaos Engineering — prove the above actually works by intentionally breaking things in a controlled way
These three form a natural triad. Graceful degradation defines what the system should do when a dependency fails. Circuit breakers detect that failure and trigger the fallback automatically. Chaos engineering proves that both mechanisms work correctly before a real incident does.
Circuit Breakers#
The Problem: Cascading Failure#
Imagine your application calls an AI API (OpenAI, Anthropic, or your own model server) for every user search query. The API begins rate-limiting you — returning HTTP 429 responses. Each call waits for a 30-second timeout before giving up.
If 100 users send queries per second, and each call holds a thread for 30 seconds before failing, you'll accumulate 3,000 blocked threads in 30 seconds. Your thread pool is exhausted — and your entire service becomes unresponsive, not because of your own code, but because a dependency is slow. This is a cascading failure: one struggling downstream service has taken down your healthy upstream service.
A circuit breaker prevents this by detecting the failure pattern and stopping calls to the troubled service, returning a fast error or fallback response immediately rather than waiting for each request to time out.
The name comes from electrical engineering: a physical circuit breaker cuts power when it detects dangerous overcurrent, preventing a fire. A software circuit breaker cuts the call path when it detects too many failures, preventing resource exhaustion.
The Three States#
The circuit breaker is a finite state machine with three states:
Circuit Breaker State Machine
The circuit breaker tracks outcomes in a rolling window. When the failure rate exceeds the threshold, it opens — rejecting all requests immediately. After a wait period, it enters the half-open state and sends a small number of test requests to probe for recovery.
State walkthrough in detail:
-
CLOSED is the default, healthy state. Every request passes through to the downstream service. The breaker tracks outcomes in a rolling window — a fixed-size buffer of recent call outcomes. This can be count-based (e.g., track the last 100 calls) or time-based (e.g., track all calls within the last 60 seconds). If the failure rate in that window climbs above a threshold (typically 50%), the breaker transitions to OPEN. A minimum call count (e.g., 10 calls) must be reached before the threshold can fire, preventing a single failure at startup from tripping the breaker immediately.
-
OPEN means the circuit is tripped. Every incoming call is immediately rejected — no request reaches the downstream service. This is "fail fast": instead of waiting 30 seconds for a timeout, your fallback logic runs in microseconds, freeing threads and giving the failing dependency time to recover. After a configured wait period (typically 60 seconds), the breaker moves to HALF_OPEN.
-
HALF_OPEN is the recovery probe. A small number of test requests (e.g., 10) are allowed through. If they succeed, the breaker transitions back to CLOSED — the dependency has recovered and normal traffic resumes. If they fail, the breaker returns to OPEN and the wait timer resets.
Circuit Breakers and AI APIs#
The circuit breaker pattern is especially important when calling LLM APIs (OpenAI, Anthropic, Google Gemini). These APIs enforce real rate limits and have experienced production outages. A naive implementation without a circuit breaker will hammer the API through every timeout, exhausting your thread pool and making your entire application unresponsive.
A practical guideline for AI API integrations:
- HTTP 429 (Too Many Requests) and HTTP 503 (Service Unavailable) should both count as failures in the breaker window.
- Slow calls (requests that succeed but take longer than your latency budget, e.g., >10 seconds) should also count as failures — most circuit breaker libraries let you configure a "slow call duration threshold." A slow call is often the earliest warning sign of an overloaded service.
- HTTP 401 / 403 (authentication errors) should not count as failures — these indicate a bug in your configuration, not the API being unhealthy.
Retry and circuit breaker together: Retries handle transient errors (a single request that failed due to a brief network blip). Circuit breakers handle sustained failures (the service is genuinely down). Combining them without care creates retry storms: if 1,000 clients each retry 3 times with 1-second intervals against a struggling service, you multiply the incoming load by 3× at exactly the worst moment. The correct layering is: the circuit breaker is the outer wrapper; retry logic runs inside it — meaning a single "call" from the circuit breaker's perspective may internally retry several times before recording one outcome. Once the circuit opens, the breaker rejects calls immediately before any retry logic runs, so no requests reach the downstream service until the circuit enters HALF_OPEN.
| Retry | Circuit Breaker | |
|---|---|---|
| What it handles | Transient, short-lived failures (network blip, one slow response) | Sustained failures — the service is down or rate-limiting for a prolonged period |
| Behavior | Reattempt the same call after a delay | Stop all calls immediately; return a fast failure or fallback response |
| Risk without care | Retry storms — 1,000 clients each retrying amplifies load on an already struggling service | Missing retries for transient errors that would have succeeded on the second attempt |
| Use together? | Yes — retry inside the circuit breaker for transient errors; the breaker stops all retries when it opens | Yes — they are complementary, not alternatives |
What AI Agents Get Wrong About Circuit Breakers#
AI-Generated Code and Circuit Breakers
AI agents generate functionally correct API call code but almost never include circuit breakers or fallback logic. In production with an unreliable dependency, this code causes cascading failures that bring down unrelated parts of the system.
A note on exponential backoff with jitter: When retrying inside a closed circuit breaker, use exponential backoff with random jitter. The formula is roughly min(base × 2^attempt, max_backoff) + random_jitter. The jitter component is critical: without it, all clients that failed at the same moment back off for the exact same duration and retry simultaneously — a "thundering herd" (a sudden, synchronized burst of traffic where many clients all hammer a service at exactly the same instant) that spikes the struggling service again. Jitter spreads retries out across a time window, smoothing the traffic shape. AWS SDKs use this pattern by default; most other retry libraries support it with configuration.
Graceful Degradation#
A circuit breaker stops your service from hammering a failing dependency. But what do you tell the user? The answer is graceful degradation: instead of returning an error page, your system returns a reduced-but-functional response. The user can still accomplish their core task — just with fewer features.
The key principle: a degraded experience is always better than a broken one. An e-commerce site that shows popular products instead of personalized AI recommendations during an outage is still usable. A search page that returns keyword results instead of AI-ranked results is still functional. An error page that says "Service unavailable" is neither.
The Fallback Hierarchy#
Graceful degradation means designing explicit fallback tiers, each one simpler and more reliable than the last. When a tier fails, the system drops to the next one automatically.
AI Search Fallback Hierarchy
A real-world example of graceful degradation: an AI-powered search feature with four fallback tiers. Each tier is simpler, faster, and more reliable than the one above it. The system drops to the next tier automatically when the current one fails or its circuit breaker opens.
The keyword search tier (Tier 2) uses traditional relevance scoring algorithms — BM25 and TF-IDF rank results by how often query terms appear in a document, adjusted for document length and term rarity. They are less sophisticated than AI-powered semantic search, but they run entirely on your own infrastructure with no external API calls, making them a highly reliable fallback.
Feature Flags: The Control Plane for Degradation#
Hardcoding fallback logic directly in application code — with if/else branches — has a serious flaw: switching between tiers requires a redeployment. During an incident, when the primary tier has just failed, the last thing you want is to trigger a deployment to activate the fallback.
Feature flags (also called kill switches when used for degradation) are runtime toggles that control which tier is active without a code change or deployment. When the AI search service starts struggling, an on-call engineer flips a flag in a dashboard and all traffic routes to keyword search in seconds.
Advanced feature flag platforms (Unleash, LaunchDarkly) support automatic degradation: when monitoring detects that the AI search error rate has exceeded a threshold, the platform flips the flag automatically — no human intervention required. This is critical for incidents that happen off-hours. The system degrades safely without waking anyone up, and the on-call engineer reviews the incident in the morning rather than at 3am.
Graceful Degradation for AI Features#
AI features have a natural degradation hierarchy based on capability and reliability:
| Tier | Behavior | When to Use It |
|---|---|---|
| Full AI | Large frontier model (GPT-5.4, Claude Opus 4.6, Gemini 3 pro) — highest accuracy and quality | Normal operation; primary model API is healthy and within rate limits |
| Smaller model | Faster, cheaper model (Claude Haiku, GPT-4o-mini) — lower quality but still AI-powered | Primary model is overloaded or rate-limited; an acceptable quality reduction is preferable to an error |
| Rule-based logic | Deterministic rules — keyword matching, hard thresholds, decision trees — always available | All AI providers are unavailable; reliability matters more than intelligence |
| Queue and defer | Accept the request, buffer it in a queue, process when the AI service recovers | Non-time-critical tasks: batch recommendations, content moderation, background analytics |
| Fail safe | Stop completely — do not attempt a guess or approximation | Safety-critical systems (medical diagnosis, autonomous systems) where a wrong answer is worse than no answer |
A concrete example: a customer support chatbot backed by a large LLM. When the LLM API degrades:
- Try routing to a smaller, faster model (Claude Haiku instead of Claude Opus)
- If that also fails, fall back to a rule-based FAQ system that matches keywords to pre-written answers
- If the FAQ system finds no match, show: "Our AI assistant is temporarily unavailable. Here are our most common help topics." with links to documentation
- Always allow the user to submit a support ticket — the core function of getting help must survive any AI outage
The isolation requirement: The rule-based fallback must be completely independent of your AI infrastructure — different code path, different dependencies, different process. If they share the same server or the same database, the failure that took down the LLM can take down the fallback. Fallback tiers that fail together are not fallbacks.
Chaos Engineering#
Circuit breakers and graceful degradation tell you how your system should behave under failure. Chaos engineering answers the harder question: does it actually behave that way?
Chaos engineering is the discipline of deliberately introducing failures into your own system — in a controlled, scientific way — to discover weaknesses before a real incident exposes them. The guiding principle, popularized by Netflix, is: "The best way to avoid failure is to fail constantly." By choosing when and how failures occur, you control the blast radius, observe the outcome, and fix problems during business hours rather than at 3am.
This is not random destruction. It is a structured experiment: you define what "normal" looks like, predict what should happen when a specific failure is introduced, inject that failure, and measure whether your prediction held.
Why Chaos Engineering Exists: The Netflix Origin#
Netflix migrated from its own data centers to AWS around 2010. AWS was not as reliable as Netflix expected, and partially-failing distributed systems are far harder to reason about than a single server going down. In 2011, Netflix released Chaos Monkey — a tool that randomly terminates EC2 instances in production during business hours.
The insight was simple and profound: by knowing that random instances could die at any moment during the workday — when engineers were present to respond — Netflix forced engineering teams to build services that survive instance loss automatically. Engineers who had to restart services manually on Tuesday made sure those services restarted automatically by Wednesday.
This evolved into the Simian Army: a suite of failure-injecting tools. Chaos Gorilla simulated an entire AWS Availability Zone going down. Latency Monkey injected artificial network delays. The common thread: every tool tested a real failure mode that would eventually happen in production, on Netflix's own schedule, with Netflix engineers watching.
The Four-Step Experiment#
Every chaos experiment follows the same scientific method:
The Chaos Engineering Experiment Loop
Chaos engineering is a structured scientific process. Each experiment starts with a hypothesis and ends with either a confirmed resilience property or a discovered weakness to fix. Running experiments without a hypothesis is not chaos engineering — it is causing outages.
A worked example: Your service has a circuit breaker around the AI API, and you believe it protects the application when the API is slow.
- Steady state: p99 latency of your search endpoint (the 99th-percentile response time — meaning 99% of requests complete faster than this threshold) is under 300ms; error rate is under 0.5%.
- Hypothesis: If the AI API starts returning 5-second responses (instead of its normal 200ms), the circuit breaker will detect them as slow calls and open within 60 seconds. After opening, search falls back to keyword results with p99 under 50ms.
- Inject: Use Toxiproxy — a TCP proxy that adds configurable network conditions — to insert a 5-second latency on all calls from your app to the AI API.
- Observe: Did the circuit breaker open? Did the fallback activate? Did user-facing p99 stay within budget? Did the breaker emit a metric and trigger an alert?
If the circuit breaker opened and the fallback served results within the latency budget — your hypothesis held. If users saw 5-second latencies and no fallback activated — you found a gap in your resilience before a real incident did.
Blast Radius Management#
The "blast radius" of a chaos experiment is how many users, services, or data it can affect if something goes wrong. Managing it is what separates chaos engineering from reckless destruction.
| Technique | How It Limits Blast Radius | Example |
|---|---|---|
| Target a subset | Kill 1 of 10 instances instead of all 10 | Terminate one Kubernetes pod in a 10-replica deployment; the other 9 continue serving traffic |
| Percentage traffic split | Route only 1% of traffic through the experimental path | Use a load balancer rule to send 1% of requests through the artificially delayed network path |
| Staging first | Run in a production-mirrored environment before touching production | Validate that the experiment produces the expected outcome before running it on real users |
| Abort conditions | Automatically halt the experiment if metrics cross a threshold | If user-facing error rate exceeds 1%, stop the experiment and revert immediately |
| Time-bounded | Run for a fixed window, then automatically revert | Inject latency for 5 minutes only — prevents an experiment from accidentally running indefinitely |
Chaos Engineering Validates the Triad#
The most important use of chaos engineering in a system with circuit breakers and graceful degradation is to validate the whole system together. Circuit breakers and fallbacks can be individually correct but still fail to interact properly — the circuit breaker might open but the fallback path might have a bug, or the feature flag controlling the fallback might not be accessible when the AI API is down.
Without chaos engineering, you have a circuit breaker and a fallback — but no proof they work together. Chaos engineering turns your resilience design into a verifiable property of the system.
Tools Overview#
| Tool | Type | Best For |
|---|---|---|
| Toxiproxy (Shopify) | Open source — TCP proxy that injects latency, packet loss, and disconnects between services | The easiest starting point — runs locally, no Kubernetes required, tests specific network failure modes between two services |
| Chaos Mesh (CNCF) | Open source — Kubernetes-native; supports 17+ fault types including CPU stress, pod kill, network partition, and time manipulation | Kubernetes-heavy production environments; integrates with CI/CD pipelines |
| LitmusChaos (CNCF) | Open source — workflow-based Kubernetes fault injection with a pre-built experiment library (ChaosHub) | Cloud-native teams wanting a library of pre-built experiments without writing code from scratch |
| Gremlin | Managed SaaS — hosted platform with safety controls, blast radius management, and audit logging built in | Enterprise teams wanting guardrails, compliance audit trails, and a managed service |
| Chaos Monkey (Netflix) | Open source — randomly terminates EC2 instances on a configurable schedule | AWS-based deployments; the original tool; simplest to understand conceptually |
For most teams starting out, Toxiproxy is the right entry point: run it locally to inject latency or drop connections between your service and a dependency, then observe how your circuit breaker and fallback respond. No Kubernetes required, no production access needed — just a local integration test that proves your resilience logic works before you ever touch production.
Summary#
| Concept | Key Point |
|---|---|
| Why fault tolerance matters | In distributed systems, partial failures are inevitable. The goal is to contain failures so a single component breaking does not cascade into a full outage. |
| Circuit breaker states | CLOSED (requests flow through) → OPEN (fast-fail all requests after failure threshold) → HALF_OPEN (probe for recovery with limited test calls) → back to CLOSED. |
| Threshold tuning | Failure rate threshold, minimum call count, and slow-call threshold must be calibrated against your baseline. Too low: breaker trips on normal variance. Too high: breaker never fires when you need it. |
| Retry + circuit breaker | Retries handle transient errors. Circuit breakers handle sustained failures. Layer them correctly: retry inside the breaker, not outside — once the circuit opens, retries stop entirely. |
| Graceful degradation | Design explicit fallback tiers. When the AI API is down, fall back to keyword search; when that fails, return cached results. Each tier must be architecturally isolated from the ones above it. |
| Feature flags as kill switches | Use feature flags as the runtime control plane for degradation — switch between tiers in seconds without a redeployment. Advanced platforms support automatic degradation triggered by metrics thresholds. |
| Chaos engineering | A structured scientific process: define steady state, form a hypothesis, inject a failure, observe the outcome. No hypothesis = no chaos engineering, just causing random outages. |
| Blast radius | Limit experiment scope: target one instance at a time, use percentage traffic splits, define abort conditions, and run in staging before production. |
| The triad | Graceful degradation defines what to do on failure. Circuit breakers detect failure and trigger the fallback. Chaos engineering proves it all works. Use all three together. |
| AI agent gap | AI generates direct API calls with no circuit breakers or fallbacks. Always specify: 'Wrap this call in a circuit breaker with these thresholds and return this fallback when the circuit is open.' |
Fault-tolerant systems do not prevent failures — they make failures boring. A circuit breaker opening, a fallback activating, and a chaos experiment confirming everything works is just a Tuesday. A cascading failure that takes down your entire application because one dependency was slow is a 3am incident. Design for the Tuesday.
Sources:
- Resilience4j CircuitBreaker Documentation
- Circuit Breaker — Martin Fowler
- Efficient Fault Tolerance with Circuit Breaker Pattern — Aerospike
- Rate Limiter vs. Circuit Breaker in Microservices — GeeksforGeeks
- Graceful Degradation in Practice: How FeatureOps Builds Real Resilience — Unleash
- Failing with Dignity: A Deep Dive into Graceful Degradation — CodeReliant
- When AI Breaks: Building Degradation Strategies for Mission-Critical Systems — ItSoli
- Principles of Chaos Engineering
- How Netflix Uses Chaos Engineering — System Design Newsletter
- Chaos Engineering: The Evolution from Netflix's Chaos Monkey to AI-Powered Resilience
- How to Run a Chaos Engineering GameDay — Steadybit
- Chaos Monkey — Netflix Open Source
- Lessons Netflix Learned from the AWS Outage — Netflix TechBlog