Container Orchestration

Docker solved the "works on my machine" problem by packaging an application and its dependencies into a portable container. But Docker alone answers only the first question: how do I run one container reliably? In production, you are not running one container — you are running dozens or hundreds, across multiple machines, and they need to start, stop, recover from failures, scale up under load, and communicate with each other without any of that being managed by hand.

This is the container orchestration problem, and Kubernetes is the industry-standard answer to it.

The Problem Kubernetes Solves#

To understand why Kubernetes exists, picture what life looked like before it.

A team has five microservices, each running as a Docker container. They have three production servers. They write shell scripts to decide which service runs on which server. When a container crashes at 2am, an alert fires and someone SSHs in to restart it. When traffic spikes, they manually docker run more copies of the API server — on whichever server happens to have spare CPU, if they remember to check. When they deploy a new version of a service, they stop the old container and start the new one, which means a few seconds of downtime. When one of the three servers dies, everything that was running on it is just gone until someone notices.

This was the reality for teams running containerized apps at scale before orchestrators existed. The work of managing the containers became a second full-time job.

Manual Problem	What Kubernetes Does Automatically
Container crashed — needs restart	Detects failure via health checks and restarts the container automatically
Traffic spike — need more copies	Scales replicas up and down based on CPU, memory, or custom metrics (HPA)
Deploy a new version — downtime	Performs rolling updates: starts new pods before stopping old ones, zero downtime
Server died — all containers lost	Detects node failure and reschedules all affected containers on healthy nodes
Which server has spare capacity?	Scheduler finds the best-fit node automatically based on resource availability
Services need to discover each other	Provides built-in DNS so services find each other by name, not by IP address

Kubernetes originated as an open-source project from Google in 2014, encoding over a decade of lessons from their internal cluster management systems (Borg and Omega). Today it is the foundation of virtually every major cloud platform's container offering: Amazon EKS, Google GKE, and Azure AKS are all managed Kubernetes.

Kubernetes Architecture#

A Kubernetes cluster is a set of machines — physical or virtual — divided into two roles: the control plane (the brain) and worker nodes (the muscle). You interact with the cluster using kubectl (the Kubernetes command-line tool), which sends your requests to the control plane. For example, kubectl apply -f deployment.yaml tells Kubernetes to create or update the resources described in that file, and kubectl get pods lists the pods currently running in the cluster. For a more interactive experience, k9s provides a terminal-based UI that lets you navigate and manage your cluster resources in real time — I highly recommend using it.

Kubernetes Cluster Architecture

The control plane manages cluster state and makes decisions. Worker nodes run the actual application containers. Every component communicates through the API server — the single front door to the cluster.

Rendering diagram...

Control Plane Components#

kube-apiserver — The single entry point to the cluster. Every operation — kubectl apply, scaling decisions, health checks — is an API call to the apiserver. It is the only component that reads and writes to etcd directly.

etcd — A distributed key-value store that holds the entire cluster state: every deployment, service, pod, and configuration. Think of it as the cluster's database. It is strongly consistent — every read returns the latest committed write.

kube-scheduler — Watches for new pods that have not been assigned to a node, and picks the best node for them. "Best" is determined by filtering (does the node have enough CPU/memory?) and ranking (which node has the most available capacity, or best matches affinity rules?).

kube-controller-manager — Runs a collection of controllers that continuously watch the cluster state and fix any divergence from what was requested. The Deployment Controller, ReplicaSet Controller, and Node Controller are all loops running inside this single binary, each responsible for reconciling one type of resource.

Worker Node Components#

kubelet — The agent running on every worker node. It receives pod specifications from the API server and ensures the containers described in those specs are actually running on the node. If a container crashes, the kubelet restarts it.

kube-proxy — Manages network rules on each node so that network traffic destined for a Kubernetes Service is forwarded to the correct pods, regardless of which node they are on.

Container runtime — The software that actually pulls container images and starts containers on each node. The standard today is containerd, the same low-level runtime that Docker uses internally. You rarely interact with the container runtime directly — the kubelet handles it on your behalf.

Core Concepts#

Pods#

A pod is the smallest deployable unit in Kubernetes. It wraps one or more containers that are tightly coupled — they always run on the same node, share the same network namespace (same IP address), and can share storage volumes.

In practice, most pods run a single container. Multi-container pods are used for sidecar patterns — for example, a logging agent that reads from the same filesystem as the main application, or a service mesh proxy like Envoy that intercepts all network traffic.

Pods are ephemeral. They are not "restarted in place" — when a pod needs to be replaced (crash, node failure, rolling update), the old pod is deleted and a new one is created. The new pod gets a new IP address, which is why you almost never talk to pods directly. Services handle that abstraction.

Deployments#

A Deployment is the standard way to run a stateless application. You describe the desired state — "run 3 replicas of this container image" — and the Deployment controller continuously works to match reality to that description.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3                         # How many pods to run
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
    spec:
      containers:
      - name: api
        image: myregistry/api:v1.2.0  # Always use a specific tag, not :latest
        ports:
        - containerPort: 8000
        resources:
          requests:
            cpu: "250m"               # 250 millicores = 0.25 CPU
            memory: "256Mi"
          limits:
            cpu: "500m"
            memory: "512Mi"

Deployments also manage rolling updates. When you update the image tag, Kubernetes starts new pods with the new image before terminating old ones. At no point is there zero capacity — the transition is seamless. If the new pods fail health checks, Kubernetes stops the rollout automatically. You can roll back with a single command.

Services#

Because pods are ephemeral and their IP addresses change constantly, you need a stable endpoint to reach them. That is a Service.

A Service creates a stable DNS name and virtual IP that routes traffic to all pods matching a label selector. Traffic is load-balanced across the matching pods automatically.

apiVersion: v1
kind: Service
metadata:
  name: api-server
spec:
  selector:
    app: api-server         # Routes to all pods with this label
  ports:
  - port: 80
    targetPort: 8000
  type: ClusterIP           # Internal-only by default

Service Type	Who Can Reach It	Common Use
ClusterIP (default)	Other pods inside the cluster only	Internal microservice communication — e.g., the API calls the user-service at `http://user-service:80`
NodePort	External traffic via a specific port on every node	Development and testing; not recommended for production because it exposes a raw port on every node
LoadBalancer	External traffic via a cloud load balancer	Exposing a single service to the internet on a cloud provider (AWS ELB, GCP Load Balancer)
ExternalName	Maps to an external DNS name	Pointing a Service at an external managed database like RDS without hardcoding the hostname in application code

Service DNS in Kubernetes: When you create a Service, Kubernetes automatically creates a DNS record for it. Within the same namespace, a pod reaches the Service simply by name: http://api-server:80. From a different namespace, use the fully-qualified name: http://api-server.production.svc.cluster.local:80. This built-in DNS is why microservices can discover each other by name without any manual configuration — one of the core conveniences Kubernetes provides over plain Docker.

Ingress#

For production HTTP workloads, you rarely want one LoadBalancer Service per application — that creates one cloud load balancer per service, which adds cost and complexity. An Ingress solves this by acting as a single smart HTTP router in front of many services.

Ingress: One Entry Point for Multiple Services

An Ingress controller (typically nginx or Traefik) sits at the cluster edge, terminates TLS, and routes incoming HTTP requests to different internal services based on hostname or URL path — all through a single cloud load balancer.

Rendering diagram...

Namespaces#

Namespaces are virtual clusters inside a physical cluster. They scope Kubernetes resources — a Service named api in the production namespace is completely separate from a Service named api in the staging namespace.

Common patterns:

By environment: production, staging, development namespaces on the same cluster
By team: team-payments, team-auth, team-frontend for large organizations
System namespaces: kube-system (control plane components), kube-public, kube-node-lease — do not delete these

Namespaces also let you apply ResourceQuotas (cap how much CPU and memory a namespace can consume) and RBAC policies (which teams can deploy to which namespace).

Scheduling, Self-Healing, and Scaling#

These three mechanisms are what make Kubernetes worth its complexity. Together they automate the operational work that previously required humans on-call.

Scheduling#

When you create a pod, the kube-scheduler assigns it to a node in two phases:

Filter — eliminate nodes that cannot run the pod (insufficient CPU/memory, node is not ready, pod has a node affinity requiring specific labels the node lacks)
Score — rank the remaining nodes by preference (spread pods across nodes for redundancy, pack pods tightly on fewer nodes to save cost, prefer nodes with the required image already cached)

The scheduler assigns the pod to the highest-scoring node. The kubelet on that node then starts the containers.

Rendering diagram...

Resource requests and limits feed directly into the filter phase. When you specify requests.cpu: 250m, you are telling the scheduler "this pod needs at least 0.25 CPU to run". The scheduler only considers nodes with at least that much unallocated CPU. Setting no requests means the scheduler treats the pod as needing zero resources — it can land on an already-overloaded node, causing unpredictable performance.

Rule of thumb for requests vs. limits: Set requests to the amount of CPU and memory your container typically needs under normal load. Set limits to the maximum you are willing to let it consume before it is throttled (CPU) or killed (memory). Always set both in production — unset limits allow a buggy container to consume all resources on a node.

Self-Healing#

Kubernetes continuously reconciles actual state against desired state. When reality diverges from the spec — a container crashes, a node goes down — controllers correct it automatically.

Kubernetes Self-Healing Mechanisms

Kubernetes uses three interlocking mechanisms to detect and recover from failures: liveness probes (restart broken containers), readiness probes (stop sending traffic to unready pods), and the ReplicaSet controller (maintain the desired number of healthy replicas).

Rendering diagram...

Autoscaling#

Kubernetes has three distinct autoscaling mechanisms that work at different levels:

Mechanism	What It Scales	Trigger	Typical Use Case
HPA — Horizontal Pod Autoscaler	Number of pod replicas in a Deployment	CPU utilization, memory utilization, or custom metrics (requests/sec, queue depth)	Web APIs that need more copies under traffic load
VPA — Vertical Pod Autoscaler	CPU/memory `requests` of individual pods	Historical resource usage	Long-running jobs where you are unsure of the right resource requests
Cluster Autoscaler	Number of nodes in the cluster	Pending pods that cannot be scheduled (not enough room)	Cloud clusters where you want nodes to appear and disappear with workload

HPA is the most commonly used. It is a control loop that runs every 15 seconds and adjusts replica count using this formula:

desiredReplicas = ceil( currentReplicas × (currentMetricValue ÷ targetMetricValue) )

Example: You have 2 replicas running. The target CPU utilization is 50%. Current average CPU across the 2 pods is 80%. HPA calculates ceil(2 × (80 ÷ 50)) = ceil(3.2) = 4 replicas and scales up to 4.

Horizontal Pod Autoscaler (HPA)

HPA watches a metric — most commonly CPU utilization — and adjusts the number of pod replicas to keep the metric near the target. Scale-up is fast; scale-down is intentionally slow (default 5 minutes) to avoid thrashing under bursty traffic.

Rendering diagram...

Docker and Kubernetes: How They Fit Together#

Docker and Kubernetes are complementary, not competing. A useful analogy: Docker is like a standardized shipping container (a self-contained box with everything inside). Kubernetes is like the port authority that manages a fleet of cargo ships — deciding which ship carries which containers, rerouting when a ship goes down, and calling in more ships when the port gets busy.

Rendering diagram...

The workflow in practice:

You write code and a Dockerfile
Your CI pipeline runs docker build and docker push to push the image to a registry (ECR, GCR, Docker Hub)
Kubernetes pulls the image from the registry and runs it as pods on worker nodes
Kubernetes handles everything after that: scheduling, restarts, scaling, updates

Docker answers: "How do I package my application into a portable unit?" Kubernetes answers: "How do I run that unit reliably at scale across many machines?"

	Docker	Kubernetes
Core job	Build and run individual containers	Orchestrate many containers across many machines
Scope	Single host (or Compose: single machine)	Multi-node cluster
Handles failure?	No — a crashed container stays crashed	Yes — automatically restarts and reschedules
Scales automatically?	No — you manage replica count manually	Yes — HPA adjusts replicas based on metrics
Service discovery	Manual (user-defined networks in Compose)	Built-in DNS for all Services
Rolling updates	Manual — stop old, start new	Automatic — new pods before old pods terminate
Analogy	A cargo container and the crane that loads it	The port authority managing the entire fleet

Kubernetes and AI Workloads#

Kubernetes is the de facto platform for running AI inference services and, increasingly, training jobs. AI workloads add specific challenges on top of standard container orchestration.

Why Kubernetes for AI?#

AI inference services — APIs that serve a model — are stateless HTTP services. HPA scales them exactly like any other API: when request volume rises, more pods start. When it drops, pods scale down. The standard Kubernetes patterns apply directly.

The harder case is GPU workloads. GPUs are expensive, require special drivers, and behave differently from CPUs in how Kubernetes schedules them.

AI Inference Service on Kubernetes

An AI inference API running on Kubernetes benefits from automatic scaling, health probes, and rolling updates — the same mechanisms that benefit any stateless service. The primary additions are GPU resource requests, a longer readiness probe timeout for model loading, and scaling based on request latency rather than CPU.

Rendering diagram...

Key GPU Scheduling Concepts#

GPU resource requests — You request a GPU with nvidia.com/gpu: 1 in the pod's resource spec. The NVIDIA device plugin (deployed by the NVIDIA GPU Operator) advertises available GPUs to the Kubernetes scheduler, which then only places the pod on a node with a free GPU.

NVIDIA GPU Operator — A Kubernetes operator that automates the installation of GPU drivers, the NVIDIA Container Toolkit (which allows containers to access GPUs), and the device plugin on every GPU node in the cluster. Without it, you would need to pre-install drivers on each node manually.

Gang scheduling — Distributed training jobs (e.g., training a model across 8 GPUs on 4 nodes) require all pods to start simultaneously, or none should start. Standard Kubernetes schedules pods independently — it might start 7 of 8, and the 8th pod cannot be scheduled because no GPU node has capacity, leaving 7 GPUs idle. Specialized tools like Volcano, Kueue, and the NVIDIA KAI Scheduler add gang scheduling support.

What AI Agents Get Wrong with Kubernetes#

AI agents can generate Kubernetes YAML that is functional — it deploys and runs — but is missing the production requirements that make it actually reliable and secure.

AI-Generated Kubernetes Manifests: Common Gaps

AI agents produce structurally valid Kubernetes YAML but consistently omit resource requests and limits, health probes, security context settings, and replica counts above 1. These omissions produce deployments that pass basic tests but fail under production load or are a security liability.

Rendering diagram...

A Practical Mental Model: The Declarative Loop#

The deepest shift in thinking that Kubernetes requires is moving from imperative to declarative operations.

Imperative (how most people think initially): "Start 3 containers on these specific servers."

Declarative (the Kubernetes way): "The desired state is: 3 replicas of this container. Make it so, and keep it that way forever."

The difference is subtle but profound. When you apply a Kubernetes manifest, you are not issuing a command — you are updating the cluster's desired state in etcd. The controllers then continuously work to match the actual state of the cluster to that desired state. If a pod crashes, the desired state still says 3 replicas, so the controller creates a replacement. If you scale from 3 to 5, you update the desired state to 5, and the controller creates 2 new pods.

Rendering diagram...

This declarative model is why Kubernetes manifests are stored in version control just like application code. The YAML files in your repository are your infrastructure — checking them into git means the state of your cluster is auditable, reviewable, and recoverable.

Summary#

Concept	Key Takeaway
What K8s solves	Manual container management at scale: restarts, scheduling, scaling, discovery, and rolling updates — all automated
Control plane vs. workers	Control plane (apiserver, etcd, scheduler, controller-manager) manages desired state. Worker nodes (kubelet, kube-proxy, container runtime) run pods.
Pod	Smallest deployable unit — one or more containers sharing a network namespace. Ephemeral by design: never restart in place, always replaced.
Deployment	Declarative spec for running N replicas of a container, with rolling updates and rollback built in.
Service	Stable DNS name and virtual IP routing traffic to pods. ClusterIP for internal; LoadBalancer for external; Ingress for multi-service HTTP routing.
Resource requests/limits	Requests tell the scheduler what the pod needs; limits cap consumption. Always set both — they are the foundation of safe scheduling and cost control.
Health probes	Liveness probes restart deadlocked containers. Readiness probes withhold traffic from pods that are not yet ready. Always define both.
HPA	Horizontal Pod Autoscaler adjusts replica count based on CPU, memory, or custom metrics. Pairs with Cluster Autoscaler to grow/shrink the node pool automatically.
Declarative model	Apply desired state; controllers reconcile continuously. Store manifests in version control — the YAML files are your infrastructure.
AI workloads	GPU requests, long readiness probe timeouts for model loading, and metric-based autoscaling (KEDA) are the primary additions for AI inference on K8s.
AI agent gaps	AI generates structurally valid YAML but omits resource limits, health probes, and security contexts. Always explicitly request these, and lint manifests with tools like kube-score.

Sources:

PreviousDocker

NextCI/CD Pipelines

Container Orchestration

Kubernetes Cluster Architecture

Ingress: One Entry Point for Multiple Services

Kubernetes Self-Healing Mechanisms

Horizontal Pod Autoscaler (HPA)

AI Inference Service on Kubernetes

AI-Generated Kubernetes Manifests: Common Gaps

Arch Advisor