Deployment Strategies
Every time you ship a new version of your application, you face one unavoidable problem: the old version is serving real users right now. Somehow, the new version needs to take its place — without users experiencing errors, and without giving you a one-way door you cannot step back through.
The answer to that problem is your deployment strategy. It determines how your new code replaces the old, whether users experience downtime, how quickly you can detect a bad release, and how fast you can undo it.
Most developers start with the simplest possible approach: stop the old version, start the new one. That works fine on a laptop. In production, with users depending on 24/7 availability, it is a direct path to outages and user-facing errors.
Why the Simplest Approach Fails#
The most basic deployment — stop the running process, deploy new code, start it again — is called a Recreate deployment. Between the stop and the start, your service is completely unavailable. Every request during that window fails.
The Recreate strategy does have legitimate uses: development environments, batch processing jobs with no live traffic, and stateful systems where running two versions simultaneously would cause data corruption. But for any production service with real users, the goal is to make the transition invisible.
The four strategies that achieve this are Rolling Update, Blue/Green, Canary, and Shadow. Each makes a different trade-off between infrastructure cost, operational complexity, and how much risk you expose to real users before you know whether a release is good.
| Strategy | Downtime? | Rollback Speed | Extra Infra Cost | Best For |
|---|---|---|---|---|
| Recreate | Yes | Full redeploy | None | Dev environments, batch jobs, stateful systems that cannot run two versions at once |
| Rolling Update | No | 2–5 minutes | None | Most stateless web services — the right Kubernetes default |
| Blue/Green | No | Seconds | 2x production infra | High-stakes releases requiring instant rollback or precise cutover timing |
| Canary | No | Automatic on metric breach | Small overhead | High-traffic services where catching bugs on 5% of real traffic is worth the tooling cost |
| Shadow | No | N/A — no real traffic affected | 2x backend compute | Validating critical infrastructure replacements under real load without any user impact |
Rolling Update#
A rolling update replaces instances of the old version with the new version gradually — one or a few at a time — while other instances continue serving traffic. At no point is the service fully down. Kubernetes uses this as its default strategy because it handles the common case — stateless HTTP services — correctly with zero infrastructure overhead.
Rolling Update
Kubernetes replaces pods one at a time, respecting two parameters: maxSurge (how many extra pods above the desired count can exist during the rollout) and maxUnavailable (how many pods below the desired count are acceptable). Setting maxSurge: 1 and maxUnavailable: 0 guarantees zero downtime: a new pod is created and passes its readiness probe before any old pod is removed.
Blue/Green Deployment#
A blue/green deployment maintains two complete, identical environments: the blue environment runs the current live version, and the green environment runs the new version. All traffic goes to blue. You deploy and test the new version in green. When you are confident it is good, you flip the load balancer to send all traffic to green in one instant switch.
The old environment (blue) does not disappear — it stays running as a hot standby. If green shows problems after the flip, you route traffic back to blue in seconds, with no redeployment required.
Blue/Green Deployment
Two complete environments run simultaneously. The load balancer or Kubernetes Service selector determines which one receives live traffic. The inactive environment stays warm as an instant rollback target. A single selector change switches all traffic atomically — no gradual rollout, no partial states.
Canary Deployment#
A canary deployment routes a small percentage of real production traffic to the new version while the rest continues using the stable version. You watch the new version's error rates, latency, and business metrics closely. If the metrics are healthy, you gradually increase the traffic slice. If anything looks wrong, you route traffic back to the stable version — often automatically, triggered by metric thresholds you define in advance.
The name comes from coal miners who used canaries to detect toxic gases underground before they could harm humans. A canary deployment is your early warning system: a small slice of real users hit the new version first, and a problem that would affect everyone is detected while it still only affects a few.
Canary Deployment
A small percentage of real traffic is sent to the new version (v2). Automated metric analysis decides whether to promote (increase the traffic slice) or roll back (return to 100% v1). Tools like Argo Rollouts and Flagger automate the traffic-shifting loop and the promotion/abort decision based on Prometheus metrics or other observability providers.
Shadow Deployment#
A shadow deployment (also called a dark launch or traffic mirroring) runs the new version in parallel with the current version, but never returns its responses to users. Real production requests are duplicated — the original request is handled by v1 and returned to the user as normal; a copy of the same request is sent to v2, whose response is logged for comparison only. Users see nothing different. You see exactly how v2 behaves under real production traffic.
Shadow deployments are most valuable for validating critical infrastructure replacements: rewriting a payment processing service, replacing a recommendation engine, upgrading an AI model pipeline, or refactoring a core calculation service. Staging environments can simulate load, but they cannot fully replicate real production traffic patterns or the specific edge cases in real user data. Shadow deployments solve this: the cost of a wrong answer in these critical services is high enough to justify running two complete backends simultaneously to verify they produce equivalent results under genuine production conditions.
The key constraint: Shadow traffic must be safe to send twice. If v2 has write side effects — writing to a database, sending emails, charging a card, triggering a webhook — those side effects happen for real when shadow traffic hits v2. Shadow deployments work best for read-heavy or compute-only services. For services with write operations, shadow the reads, not the writes, or route shadow traffic to a separate test database.
The Hidden Challenge: Database Migrations#
Every deployment strategy above handles the application layer — your code. But production systems have a second layer that most beginners overlook: the database schema.
The fundamental problem: during a rolling update, v1 and v2 pods share the same live database simultaneously. During a blue/green deployment, both environments typically share the same database. If v2 requires a schema change that v1 is incompatible with, deploying the migration and the application at the same time will break one of them.
The solution is the expand-contract pattern: decouple your database migration from your application deployment by making schema changes in stages, where each stage is backward-compatible with the currently running application version.
The Expand-Contract Pattern
Safe zero-downtime schema changes follow three phases. Expand: add the new structure while keeping the old, so both the current and new app versions work. Migrate: backfill data and update the app to write to both old and new structures. Contract: remove the old structure only after the old app version is completely gone. Each phase deploys independently. At no point do both versions break simultaneously.
A practical rule for migration timing: Always deploy the migration first, then deploy the new application code. Your database should always be in a state that the currently running application version can handle. A new nullable column is invisible to v1 — it simply ignores it. A dropped column is not.
Feature Flags: Decoupling Code from Features#
Feature flags (also called feature toggles) separate two things that are often conflated: deployment (shipping code to the production infrastructure) and release (making a feature visible to users). With feature flags, you deploy v2 to 100% of your infrastructure on Monday, but only activate the new feature for 1% of users on Thursday — without any additional deployment.
Feature flags and deployment strategies complement each other — they operate at different levels of the stack:
| Tool | Controls | Rollback Mechanism | Granularity |
|---|---|---|---|
| Deployment strategy (rolling, blue/green, canary) | Which version of the code runs in the infrastructure | Re-route traffic or re-deploy older image (minutes) | Pod or instance level |
| Feature flags | Which features are visible inside the running code | Flip the flag to OFF — no deployment needed (milliseconds) | User, cohort, or percentage |
The dark launch pattern combines feature flags with shadow-style behavior: you deploy the new feature and run it in the background (writing to a shadow table, making API calls whose results you discard) to validate its correctness — before the flag turns it on for any real users. When the flag eventually flips, you already have confidence the feature works correctly because it has been running silently against real data.
Rollback: Planning for Failure#
Rollback is not an afterthought — it is half the deployment design. Before every production release, answer this question first: "If this deploy breaks something, how do we undo it, and how fast?"
| Strategy | Rollback Mechanism | Time to Rollback | Database Caveat |
|---|---|---|---|
| Rolling Update | kubectl rollout undo deployment/app — triggers a reverse rollout to the previous ReplicaSet | 2–5 minutes (re-runs rollout in reverse) | Safe if the migration was additive-only. Destructive migrations require expand-contract to be safe. |
| Blue/Green | Update the Service selector back to blue — one command | Seconds | Blue and green share a database. If green wrote data incompatible with blue, blue may see invalid state after the rollback. |
| Canary | Route 100% traffic back to stable — automatic if metric thresholds are configured, or one command manually | Seconds (automatic) or under a minute (manual) | Same as blue/green — schema changes must be backward-compatible during the canary window. |
| Feature Flag | Set the flag to OFF — no deployment required | Milliseconds | No database impact — the code that changes data only runs when the flag is ON. |
The database is the hard part of rollback. Application code is stateless and trivially reversible — swap the image tag, rerun the rollout, done. Database state is not: once a row is deleted or a column is dropped, there is no built-in undo. Three categories of database changes, ordered from safest to hardest to roll back:
- Additive only (new nullable column, new table): safe to roll back at any point. The old application version ignores the new structure.
- Data transformations (backfilling values, reformatting rows): rollback requires a compensating migration that undoes the transformation. Write the rollback migration before you run the forward migration.
- Destructive (dropping columns, dropping tables, changing column types non-reversibly): extremely difficult or impossible to reverse once data is gone. The expand-contract pattern prevents this category from ever being necessary in a live system.
Choosing the Right Strategy#
No single deployment strategy is universally correct. The right choice depends on your traffic volume, rollback requirements, database complexity, and your team's operational maturity. If you are new to Kubernetes, start with rolling updates and introduce additional complexity only when a specific production problem demands it.
| Scenario | Recommended Strategy | Rationale |
|---|---|---|
| Small team, modest traffic, straightforward stateless service | Rolling Update | Zero infrastructure overhead, built into Kubernetes by default. Master this before adding complexity. |
| High-stakes release needing instant undo and pre-cutover validation | Blue/Green | Instant rollback in seconds. Validate the new environment under production conditions before any real user sees it. |
| High-traffic service with a history of production-only bugs | Canary | Catch regressions on 5% of real traffic before full rollout. Worth the tooling cost when staging never reproduces your incidents. |
| Critical infrastructure replacement (payment service, AI model, core calculation) | Shadow | Validate new system behavior under real production traffic with zero user-facing risk before cutover. |
| Feature experimentation, A/B testing, or need an instant kill-switch | Feature Flags (combined with any deployment strategy) | Decouple release from deployment. Instant rollback without re-deploying anything. |
| Non-trivial database schema changes with zero-downtime requirement | Expand-Contract + Rolling Update | Deploy schema changes in backward-compatible phases; use rolling updates for application deployments between phases. |
What AI Agents Get Wrong with Deployments#
AI coding tools can generate Kubernetes Deployment manifests quickly — but they make consistent mistakes in the areas that matter most for production safety: readiness probes, migration planning, rollback procedures, and metric-based canary promotion.
AI-Generated Deployment Configs: Common Gaps
AI agents produce syntactically correct Kubernetes manifests but consistently omit the production requirements that matter most: readiness probes, correct rolling update parameters, database migration safety, rollback procedures, and automated canary analysis. The result works on the first deploy but fails in real incidents.
Summary#
| Concept | Key Takeaway |
|---|---|
| Recreate | Stop v1, start v2. Causes downtime. Only appropriate for dev environments or systems that cannot safely run two versions at once. |
| Rolling Update | Replace pods one at a time. Zero downtime with correct readiness probes, maxSurge: 1, maxUnavailable: 0. The Kubernetes default and the right starting point for most services. |
| Blue/Green | Two identical environments; instant atomic traffic flip. Rollback in seconds. Costs 2x infrastructure during the cutover window. |
| Canary | Route 5% of real traffic to new version, watch metrics, increase the slice. Catches production-only bugs early. Requires traffic-splitting tooling and automated metric analysis to be effective. |
| Shadow | Mirror production traffic to new version without showing responses to users. Validates behavior under real load with zero user-facing risk. Best for critical infrastructure replacements. |
| Expand-Contract | The safe pattern for database schema changes alongside zero-downtime deployments. Add the new structure (expand), migrate data and app, then remove the old structure (contract). Always run migrations before the app code that requires them. |
| Feature Flags | Decouple code deployment from feature release. Instant rollback by flipping a flag — no redeployment. Combine with any deployment strategy for fine-grained release control. |
| Rollback planning | Application rollback is fast; database rollback is hard. Additive migrations are safe to undo; destructive ones are not. The expand-contract pattern prevents the dangerous category from being needed. |
| AI agent gaps | AI generates correct manifest syntax but omits readiness probes, correct rolling update parameters, database migration safety, and rollback procedures. Always prompt explicitly for these. |
Sources:
- Canary vs Blue-Green Deployment to Reduce Downtime — CircleCI
- Zero-Downtime Deployments — HashiCorp Well-Architected Framework
- Blue-Green vs Canary Deployments — Octopus Deploy
- Kubernetes Deployment Strategies — LaunchDarkly
- 3 Best Practices for Zero-Downtime Database Migrations — LaunchDarkly
- Database Migration Strategies for Zero-Downtime Deployments — DeployHQ
- Blue Green Deployments — Martin Fowler
- Argo Rollouts Documentation
- Kubernetes Deployment Strategy — Kubernetes Docs