Container Orchestration
Docker solved the "works on my machine" problem by packaging an application and its dependencies into a portable container. But Docker alone answers only the first question: how do I run one container reliably? In production, you are not running one container — you are running dozens or hundreds, across multiple machines, and they need to start, stop, recover from failures, scale up under load, and communicate with each other without any of that being managed by hand.
This is the container orchestration problem, and Kubernetes is the industry-standard answer to it.
The Problem Kubernetes Solves#
To understand why Kubernetes exists, picture what life looked like before it.
A team has five microservices, each running as a Docker container. They have three production servers. They write shell scripts to decide which service runs on which server. When a container crashes at 2am, an alert fires and someone SSHs in to restart it. When traffic spikes, they manually docker run more copies of the API server — on whichever server happens to have spare CPU, if they remember to check. When they deploy a new version of a service, they stop the old container and start the new one, which means a few seconds of downtime. When one of the three servers dies, everything that was running on it is just gone until someone notices.
This was the reality for teams running containerized apps at scale before orchestrators existed. The work of managing the containers became a second full-time job.
| Manual Problem | What Kubernetes Does Automatically |
|---|---|
| Container crashed — needs restart | Detects failure via health checks and restarts the container automatically |
| Traffic spike — need more copies | Scales replicas up and down based on CPU, memory, or custom metrics (HPA) |
| Deploy a new version — downtime | Performs rolling updates: starts new pods before stopping old ones, zero downtime |
| Server died — all containers lost | Detects node failure and reschedules all affected containers on healthy nodes |
| Which server has spare capacity? | Scheduler finds the best-fit node automatically based on resource availability |
| Services need to discover each other | Provides built-in DNS so services find each other by name, not by IP address |
Kubernetes originated as an open-source project from Google in 2014, encoding over a decade of lessons from their internal cluster management systems (Borg and Omega). Today it is the foundation of virtually every major cloud platform's container offering: Amazon EKS, Google GKE, and Azure AKS are all managed Kubernetes.
Kubernetes Architecture#
A Kubernetes cluster is a set of machines — physical or virtual — divided into two roles: the control plane (the brain) and worker nodes (the muscle). You interact with the cluster using kubectl (the Kubernetes command-line tool), which sends your requests to the control plane. For example, kubectl apply -f deployment.yaml tells Kubernetes to create or update the resources described in that file, and kubectl get pods lists the pods currently running in the cluster. For a more interactive experience, k9s provides a terminal-based UI that lets you navigate and manage your cluster resources in real time — I highly recommend using it.
Kubernetes Cluster Architecture
The control plane manages cluster state and makes decisions. Worker nodes run the actual application containers. Every component communicates through the API server — the single front door to the cluster.
Control Plane Components#
kube-apiserver — The single entry point to the cluster. Every operation — kubectl apply, scaling decisions, health checks — is an API call to the apiserver. It is the only component that reads and writes to etcd directly.
etcd — A distributed key-value store that holds the entire cluster state: every deployment, service, pod, and configuration. Think of it as the cluster's database. It is strongly consistent — every read returns the latest committed write.
kube-scheduler — Watches for new pods that have not been assigned to a node, and picks the best node for them. "Best" is determined by filtering (does the node have enough CPU/memory?) and ranking (which node has the most available capacity, or best matches affinity rules?).
kube-controller-manager — Runs a collection of controllers that continuously watch the cluster state and fix any divergence from what was requested. The Deployment Controller, ReplicaSet Controller, and Node Controller are all loops running inside this single binary, each responsible for reconciling one type of resource.
Worker Node Components#
kubelet — The agent running on every worker node. It receives pod specifications from the API server and ensures the containers described in those specs are actually running on the node. If a container crashes, the kubelet restarts it.
kube-proxy — Manages network rules on each node so that network traffic destined for a Kubernetes Service is forwarded to the correct pods, regardless of which node they are on.
Container runtime — The software that actually pulls container images and starts containers on each node. The standard today is containerd, the same low-level runtime that Docker uses internally. You rarely interact with the container runtime directly — the kubelet handles it on your behalf.
Core Concepts#
Pods#
A pod is the smallest deployable unit in Kubernetes. It wraps one or more containers that are tightly coupled — they always run on the same node, share the same network namespace (same IP address), and can share storage volumes.
In practice, most pods run a single container. Multi-container pods are used for sidecar patterns — for example, a logging agent that reads from the same filesystem as the main application, or a service mesh proxy like Envoy that intercepts all network traffic.
Pods are ephemeral. They are not "restarted in place" — when a pod needs to be replaced (crash, node failure, rolling update), the old pod is deleted and a new one is created. The new pod gets a new IP address, which is why you almost never talk to pods directly. Services handle that abstraction.
Deployments#
A Deployment is the standard way to run a stateless application. You describe the desired state — "run 3 replicas of this container image" — and the Deployment controller continuously works to match reality to that description.
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 3 # How many pods to run
selector:
matchLabels:
app: api-server
template:
metadata:
labels:
app: api-server
spec:
containers:
- name: api
image: myregistry/api:v1.2.0 # Always use a specific tag, not :latest
ports:
- containerPort: 8000
resources:
requests:
cpu: "250m" # 250 millicores = 0.25 CPU
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
Deployments also manage rolling updates. When you update the image tag, Kubernetes starts new pods with the new image before terminating old ones. At no point is there zero capacity — the transition is seamless. If the new pods fail health checks, Kubernetes stops the rollout automatically. You can roll back with a single command.
Services#
Because pods are ephemeral and their IP addresses change constantly, you need a stable endpoint to reach them. That is a Service.
A Service creates a stable DNS name and virtual IP that routes traffic to all pods matching a label selector. Traffic is load-balanced across the matching pods automatically.
apiVersion: v1
kind: Service
metadata:
name: api-server
spec:
selector:
app: api-server # Routes to all pods with this label
ports:
- port: 80
targetPort: 8000
type: ClusterIP # Internal-only by default
| Service Type | Who Can Reach It | Common Use |
|---|---|---|
| ClusterIP (default) | Other pods inside the cluster only | Internal microservice communication — e.g., the API calls the user-service at http://user-service:80 |
| NodePort | External traffic via a specific port on every node | Development and testing; not recommended for production because it exposes a raw port on every node |
| LoadBalancer | External traffic via a cloud load balancer | Exposing a single service to the internet on a cloud provider (AWS ELB, GCP Load Balancer) |
| ExternalName | Maps to an external DNS name | Pointing a Service at an external managed database like RDS without hardcoding the hostname in application code |
Service DNS in Kubernetes: When you create a Service, Kubernetes automatically creates a DNS record for it. Within the same namespace, a pod reaches the Service simply by name: http://api-server:80. From a different namespace, use the fully-qualified name: http://api-server.production.svc.cluster.local:80. This built-in DNS is why microservices can discover each other by name without any manual configuration — one of the core conveniences Kubernetes provides over plain Docker.
Ingress#
For production HTTP workloads, you rarely want one LoadBalancer Service per application — that creates one cloud load balancer per service, which adds cost and complexity. An Ingress solves this by acting as a single smart HTTP router in front of many services.
Ingress: One Entry Point for Multiple Services
An Ingress controller (typically nginx or Traefik) sits at the cluster edge, terminates TLS, and routes incoming HTTP requests to different internal services based on hostname or URL path — all through a single cloud load balancer.
Namespaces#
Namespaces are virtual clusters inside a physical cluster. They scope Kubernetes resources — a Service named api in the production namespace is completely separate from a Service named api in the staging namespace.
Common patterns:
- By environment:
production,staging,developmentnamespaces on the same cluster - By team:
team-payments,team-auth,team-frontendfor large organizations - System namespaces:
kube-system(control plane components),kube-public,kube-node-lease— do not delete these
Namespaces also let you apply ResourceQuotas (cap how much CPU and memory a namespace can consume) and RBAC policies (which teams can deploy to which namespace).
Scheduling, Self-Healing, and Scaling#
These three mechanisms are what make Kubernetes worth its complexity. Together they automate the operational work that previously required humans on-call.
Scheduling#
When you create a pod, the kube-scheduler assigns it to a node in two phases:
- Filter — eliminate nodes that cannot run the pod (insufficient CPU/memory, node is not ready, pod has a node affinity requiring specific labels the node lacks)
- Score — rank the remaining nodes by preference (spread pods across nodes for redundancy, pack pods tightly on fewer nodes to save cost, prefer nodes with the required image already cached)
The scheduler assigns the pod to the highest-scoring node. The kubelet on that node then starts the containers.
Resource requests and limits feed directly into the filter phase. When you specify requests.cpu: 250m, you are telling the scheduler "this pod needs at least 0.25 CPU to run". The scheduler only considers nodes with at least that much unallocated CPU. Setting no requests means the scheduler treats the pod as needing zero resources — it can land on an already-overloaded node, causing unpredictable performance.
Rule of thumb for requests vs. limits: Set
requeststo the amount of CPU and memory your container typically needs under normal load. Setlimitsto the maximum you are willing to let it consume before it is throttled (CPU) or killed (memory). Always set both in production — unset limits allow a buggy container to consume all resources on a node.
Self-Healing#
Kubernetes continuously reconciles actual state against desired state. When reality diverges from the spec — a container crashes, a node goes down — controllers correct it automatically.
Kubernetes Self-Healing Mechanisms
Kubernetes uses three interlocking mechanisms to detect and recover from failures: liveness probes (restart broken containers), readiness probes (stop sending traffic to unready pods), and the ReplicaSet controller (maintain the desired number of healthy replicas).
Autoscaling#
Kubernetes has three distinct autoscaling mechanisms that work at different levels:
| Mechanism | What It Scales | Trigger | Typical Use Case |
|---|---|---|---|
| HPA — Horizontal Pod Autoscaler | Number of pod replicas in a Deployment | CPU utilization, memory utilization, or custom metrics (requests/sec, queue depth) | Web APIs that need more copies under traffic load |
| VPA — Vertical Pod Autoscaler | CPU/memory requests of individual pods | Historical resource usage | Long-running jobs where you are unsure of the right resource requests |
| Cluster Autoscaler | Number of nodes in the cluster | Pending pods that cannot be scheduled (not enough room) | Cloud clusters where you want nodes to appear and disappear with workload |
HPA is the most commonly used. It is a control loop that runs every 15 seconds and adjusts replica count using this formula:
desiredReplicas = ceil( currentReplicas × (currentMetricValue ÷ targetMetricValue) )
Example: You have 2 replicas running. The target CPU utilization is 50%. Current average CPU across the 2 pods is 80%. HPA calculates ceil(2 × (80 ÷ 50)) = ceil(3.2) = 4 replicas and scales up to 4.
Horizontal Pod Autoscaler (HPA)
HPA watches a metric — most commonly CPU utilization — and adjusts the number of pod replicas to keep the metric near the target. Scale-up is fast; scale-down is intentionally slow (default 5 minutes) to avoid thrashing under bursty traffic.
Docker and Kubernetes: How They Fit Together#
Docker and Kubernetes are complementary, not competing. A useful analogy: Docker is like a standardized shipping container (a self-contained box with everything inside). Kubernetes is like the port authority that manages a fleet of cargo ships — deciding which ship carries which containers, rerouting when a ship goes down, and calling in more ships when the port gets busy.
The workflow in practice:
- You write code and a Dockerfile
- Your CI pipeline runs
docker buildanddocker pushto push the image to a registry (ECR, GCR, Docker Hub) - Kubernetes pulls the image from the registry and runs it as pods on worker nodes
- Kubernetes handles everything after that: scheduling, restarts, scaling, updates
Docker answers: "How do I package my application into a portable unit?" Kubernetes answers: "How do I run that unit reliably at scale across many machines?"
| Docker | Kubernetes | |
|---|---|---|
| Core job | Build and run individual containers | Orchestrate many containers across many machines |
| Scope | Single host (or Compose: single machine) | Multi-node cluster |
| Handles failure? | No — a crashed container stays crashed | Yes — automatically restarts and reschedules |
| Scales automatically? | No — you manage replica count manually | Yes — HPA adjusts replicas based on metrics |
| Service discovery | Manual (user-defined networks in Compose) | Built-in DNS for all Services |
| Rolling updates | Manual — stop old, start new | Automatic — new pods before old pods terminate |
| Analogy | A cargo container and the crane that loads it | The port authority managing the entire fleet |
Kubernetes and AI Workloads#
Kubernetes is the de facto platform for running AI inference services and, increasingly, training jobs. AI workloads add specific challenges on top of standard container orchestration.
Why Kubernetes for AI?#
AI inference services — APIs that serve a model — are stateless HTTP services. HPA scales them exactly like any other API: when request volume rises, more pods start. When it drops, pods scale down. The standard Kubernetes patterns apply directly.
The harder case is GPU workloads. GPUs are expensive, require special drivers, and behave differently from CPUs in how Kubernetes schedules them.
AI Inference Service on Kubernetes
An AI inference API running on Kubernetes benefits from automatic scaling, health probes, and rolling updates — the same mechanisms that benefit any stateless service. The primary additions are GPU resource requests, a longer readiness probe timeout for model loading, and scaling based on request latency rather than CPU.
Key GPU Scheduling Concepts#
GPU resource requests — You request a GPU with nvidia.com/gpu: 1 in the pod's resource spec. The NVIDIA device plugin (deployed by the NVIDIA GPU Operator) advertises available GPUs to the Kubernetes scheduler, which then only places the pod on a node with a free GPU.
NVIDIA GPU Operator — A Kubernetes operator that automates the installation of GPU drivers, the NVIDIA Container Toolkit (which allows containers to access GPUs), and the device plugin on every GPU node in the cluster. Without it, you would need to pre-install drivers on each node manually.
Gang scheduling — Distributed training jobs (e.g., training a model across 8 GPUs on 4 nodes) require all pods to start simultaneously, or none should start. Standard Kubernetes schedules pods independently — it might start 7 of 8, and the 8th pod cannot be scheduled because no GPU node has capacity, leaving 7 GPUs idle. Specialized tools like Volcano, Kueue, and the NVIDIA KAI Scheduler add gang scheduling support.
What AI Agents Get Wrong with Kubernetes#
AI agents can generate Kubernetes YAML that is functional — it deploys and runs — but is missing the production requirements that make it actually reliable and secure.
AI-Generated Kubernetes Manifests: Common Gaps
AI agents produce structurally valid Kubernetes YAML but consistently omit resource requests and limits, health probes, security context settings, and replica counts above 1. These omissions produce deployments that pass basic tests but fail under production load or are a security liability.
A Practical Mental Model: The Declarative Loop#
The deepest shift in thinking that Kubernetes requires is moving from imperative to declarative operations.
Imperative (how most people think initially): "Start 3 containers on these specific servers."
Declarative (the Kubernetes way): "The desired state is: 3 replicas of this container. Make it so, and keep it that way forever."
The difference is subtle but profound. When you apply a Kubernetes manifest, you are not issuing a command — you are updating the cluster's desired state in etcd. The controllers then continuously work to match the actual state of the cluster to that desired state. If a pod crashes, the desired state still says 3 replicas, so the controller creates a replacement. If you scale from 3 to 5, you update the desired state to 5, and the controller creates 2 new pods.
This declarative model is why Kubernetes manifests are stored in version control just like application code. The YAML files in your repository are your infrastructure — checking them into git means the state of your cluster is auditable, reviewable, and recoverable.
Summary#
| Concept | Key Takeaway |
|---|---|
| What K8s solves | Manual container management at scale: restarts, scheduling, scaling, discovery, and rolling updates — all automated |
| Control plane vs. workers | Control plane (apiserver, etcd, scheduler, controller-manager) manages desired state. Worker nodes (kubelet, kube-proxy, container runtime) run pods. |
| Pod | Smallest deployable unit — one or more containers sharing a network namespace. Ephemeral by design: never restart in place, always replaced. |
| Deployment | Declarative spec for running N replicas of a container, with rolling updates and rollback built in. |
| Service | Stable DNS name and virtual IP routing traffic to pods. ClusterIP for internal; LoadBalancer for external; Ingress for multi-service HTTP routing. |
| Resource requests/limits | Requests tell the scheduler what the pod needs; limits cap consumption. Always set both — they are the foundation of safe scheduling and cost control. |
| Health probes | Liveness probes restart deadlocked containers. Readiness probes withhold traffic from pods that are not yet ready. Always define both. |
| HPA | Horizontal Pod Autoscaler adjusts replica count based on CPU, memory, or custom metrics. Pairs with Cluster Autoscaler to grow/shrink the node pool automatically. |
| Declarative model | Apply desired state; controllers reconcile continuously. Store manifests in version control — the YAML files are your infrastructure. |
| AI workloads | GPU requests, long readiness probe timeouts for model loading, and metric-based autoscaling (KEDA) are the primary additions for AI inference on K8s. |
| AI agent gaps | AI generates structurally valid YAML but omits resource limits, health probes, and security contexts. Always explicitly request these, and lint manifests with tools like kube-score. |
Sources:
- Kubernetes Overview — kubernetes.io
- Kubernetes Cluster Architecture — kubernetes.io
- Kubernetes Components — kubernetes.io
- Horizontal Pod Autoscaling — kubernetes.io
- Kubernetes Self-Healing — kubernetes.io
- Kubernetes vs. Docker — AWS
- Kubernetes vs. Docker — Atlassian
- Kubernetes GPU Scheduling 2025 — debugg.ai