Load Balancing & Traffic

Imagine your web application is running on a single server. On a quiet Tuesday morning, it handles requests fine. Then a blog post about your product goes viral. Thousands of users arrive simultaneously. The server is overwhelmed — CPU spikes, requests queue up, response times climb from 50ms to 10 seconds — and eventually the server crashes, taking your entire service down with it.

Load balancing is the solution: a component that sits in front of your servers and distributes incoming requests across multiple backend instances. When one server is busy, the load balancer sends the next request to a less-loaded one. If a server crashes, the load balancer stops sending it traffic. Users don't see any of this — they just get a fast, reliable response.

Not all load balancers work the same way, however. The layer of the network at which they operate determines what information they can see, what routing decisions they can make, and how fast they run. Understanding the difference between L4 and L7 load balancing is one of the most consequential architectural decisions you will make when scaling a system.

The OSI Model: Two Layers That Matter#

The OSI model is a conceptual framework that describes how different networking protocols work together. It defines seven layers, but for load balancing, two are most relevant. Knowing which layer a load balancer operates at tells you exactly what information it can see — and therefore what routing decisions it can make:

OSI Model (Simplified)
Rendering diagram...
  • Layer 4 (Transport) understands TCP and UDP connections. It can see source and destination IP addresses and port numbers — but nothing about what is inside the packets. It is fast precisely because it does not look.
  • Layer 7 (Application) understands HTTP, HTTPS, gRPC, and WebSockets. It can read URL paths, request headers, cookies, and the request body. It can make intelligent routing decisions based on application-level content — but this comes at a small cost in CPU and latency.

Think of it this way: L4 is a mail sorter who reads only the address on the envelope. L7 is a sorter who opens every envelope and reads the letter inside. The L4 sorter is faster but has less information. The L7 sorter can make smarter decisions but takes longer per envelope.

L4 Load Balancing: Fast and Simple#

An L4 load balancer operates at the TCP/UDP layer. When a client establishes a TCP connection, the load balancer picks a backend server and forwards the traffic to it — without ever inspecting what is inside the packets. In some implementations it simply rewrites the destination address on each packet, passing traffic through transparently; in others it terminates the client's connection and opens a fresh one to the backend. Either way, the key point is the same: the load balancer never reads the application data — the HTTP request, the database query, the game event — flowing through the connection.

L4 Load Balancing

Routes based on IP address and port only. Blind to application-level content. The load balancer forwards traffic without reading what is inside — acting as a fast, transparent pass-through for any TCP or UDP protocol.

Rendering diagram...

Real-world examples of L4 load balancers: AWS Network Load Balancer (NLB), HAProxy in TCP mode, Google Cloud Network Load Balancer. AWS NLB is capable of handling millions of requests per second with ultra-low latency and is the right choice when you need to expose a non-HTTP service (like a database or game server) or when you need extreme throughput.

L7 Load Balancing: Smart Routing#

An L7 load balancer operates at the HTTP/HTTPS layer. It terminates the client's connection, reads the full HTTP request (URL, headers, cookies, body), makes a routing decision based on that content, and then opens a new connection to the appropriate backend server. This two-connection model — client to load balancer, then load balancer to backend — is what enables content-aware routing. The extra step of parsing the request adds a small processing cost (typically under one millisecond), but the routing flexibility it unlocks is almost always worth it for HTTP-based systems.

L7 Load Balancing

Terminates the client's HTTPS connection, reads the full HTTP request, and routes based on URL path, headers, or cookies. Enables advanced traffic control: routing /api/* to one service and /static/* to another, or splitting a percentage of traffic to a new version (a canary deployment) to validate it before full rollout.

Rendering diagram...

Real-world examples of L7 load balancers: AWS Application Load Balancer (ALB), nginx, HAProxy in HTTP mode, Envoy Proxy, Traefik, Cloudflare Load Balancing. In Kubernetes, an Ingress controller (often nginx or Traefik) is an L7 load balancer that routes external HTTP traffic to internal services based on path and host rules.

Comparing L4 vs. L7 at a Glance#

DimensionL4 (Transport Layer)L7 (Application Layer)
SeesIP address, port number, TCP/UDPURL path, HTTP headers, cookies, request body
Routing basisSource/destination IP and portURL path, host, header, cookie, request method
SSL/TLSPassthrough (encrypted packets forwarded as-is)Termination (LB decrypts; backends get plain HTTP)
Protocol supportAny TCP/UDP protocolHTTP, HTTPS, HTTP/2, gRPC, WebSockets
SpeedFaster — no packet inspectionSlightly slower — full request parsing required
Deployment strategiesNot supportedCanary, blue/green, A/B testing via routing rules
ExamplesAWS NLB, HAProxy TCP modeAWS ALB, nginx, Envoy, Traefik
Best forDatabases, game servers, max throughputWeb apps, microservices, API gateways

The practical decision rule: if your traffic is HTTP or HTTPS and you are routing between microservices, start with an L7 load balancer — the routing power is almost always worth the negligible overhead. Reserve L4 for non-HTTP protocols or when you are optimizing for extreme throughput and every microsecond matters.

Load Balancing Algorithms#

Once a load balancer decides which layer to operate at, it still needs to decide which server to send each request to. This is the job of the balancing algorithm.

Load Balancing Algorithms

Different algorithms optimize for different goals: even distribution, server capacity, connection count, or session affinity. The right choice depends on whether your servers are homogeneous, whether requests have significantly different processing costs, and whether users need to be consistently routed to the same server.

Rendering diagram...
AlgorithmHow It WorksBest ForDrawback
Round RobinRequests are sent to servers in rotation: S1, S2, S3, S1, S2, S3...Identical servers, uniform request costIgnores server load; slow requests pile up on already-busy servers
Weighted Round RobinServers get traffic proportional to their assigned weight (e.g., S1 gets 2× the traffic of S2)Mixed server capacities in the same poolWeights must be set manually and updated as server capacity or load changes
Least ConnectionsRoutes to the server with the fewest currently active connectionsVariable request duration (AI inference, streaming, WebSockets)Requires tracking live connection counts across all servers
Least Response TimeRoutes to the server with the fastest average response timeLatency-sensitive workloads with heterogeneous backendsRequires response time monitoring with low-latency feedback
IP HashHashes the client IP to deterministically pick a server — same IP always hits the same serverLegacy apps with server-side session stateBreaks with shared NAT; a server failure redistributes all sessions for affected clients
RandomPicks a backend server uniformly at randomSimple, stateless scenarios when servers are truly identicalNo adaptation to load; occasional hot spots under statistical variance

Health Checks: Removing Failing Servers Automatically#

Every production load balancer continuously monitors whether its backend servers are healthy. When a server fails a health check, the load balancer stops sending it traffic — automatically, without human intervention.

Rendering diagram...

Health checks come in two flavors:

  • L4 health check: The load balancer attempts to establish a TCP connection to the server's port. If the connection succeeds, the server is considered healthy. Fast and simple, but a server can accept TCP connections while its application is crashed or deadlocked.
  • L7 health check: The load balancer sends an HTTP GET /health request and expects a 200 OK response. Slower but more accurate — the application itself confirms it is ready to serve traffic. Most production systems use L7 health checks.

A good /health endpoint should verify not just that the process is running, but that its critical dependencies are reachable — it should check the database connection, cache connection, and any other service the application cannot function without. If any dependency is unavailable, return 503 Service Unavailable so the load balancer removes the instance from the pool.

One important caveat: if a shared dependency (like the database) goes down, all instances will fail their health checks simultaneously. The load balancer will remove the entire pool, making the service completely unavailable. This is technically the correct behavior — the instances genuinely cannot serve requests — but it means your health check design has a direct impact on how failures propagate.

Some teams manage this more precisely by using two distinct probe types. A readiness check determines whether an instance should receive traffic: if it fails, the instance is removed from the load balancer pool, but the process keeps running. A liveness check determines whether the process itself is still alive: if it fails, the orchestrator (such as Kubernetes) restarts the process. For example, if a database connection pool is temporarily exhausted, a failing readiness check gracefully pulls the instance out of rotation until it recovers. A failing liveness check, by contrast, kills and restarts the process — appropriate for deadlocked or unresponsive processes, but too heavy-handed for a transient dependency issue. Separating the two gives you finer control over exactly how your system responds to different failure modes.

Rate Limiting: Protecting Your System from Overload#

Load balancers distribute traffic across servers, but they do not control how much traffic any one client can send. Without limits, a single misbehaving client — or an attacker — can flood your servers with requests, starving legitimate users of capacity. Rate limiting is the mechanism that enforces per-client traffic quotas.

Rate limiting is applied at the edge — typically in the load balancer, API gateway, or a middleware layer — before requests reach your application servers. The most widely used algorithm is the Token Bucket.

Token Bucket Rate Limiting

Each client gets a virtual bucket that holds tokens up to a maximum capacity. Tokens are added at a fixed refill rate (e.g., 10 per second). Each request consumes one token. If the bucket has tokens, the request is allowed. If the bucket is empty, the request is rejected with HTTP 429 Too Many Requests. The bucket capacity controls how large a burst is permitted; the refill rate controls the sustained throughput limit.

Rendering diagram...

How to implement Token Bucket with Redis: Store two values per client key: tokens (current token count) and last_refill (Unix timestamp of last refill). On each request: compute elapsed time since last_refill, add elapsed_seconds × refill_rate tokens (capped at capacity), then check if tokens >= 1. If yes, deduct 1 token, persist both values, and allow the request. If no, reject with 429. Use a Redis Lua script to make this check-and-deduct atomic, preventing race conditions under concurrent load.

HTTP 429 and Retry-After#

When a rate limit is exceeded, always return HTTP 429 Too Many Requests — the standard status code for rate limiting. Include a Retry-After header indicating how many seconds the client should wait before retrying, so that well-behaved clients back off automatically rather than immediately hammering the server again.

HTTP/1.1 429 Too Many Requests
Retry-After: 5
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1720000005

One subtlety: if many clients all hit the rate limit at the same moment and receive an identical Retry-After: 5, they may all retry simultaneously 5 seconds later — a phenomenon called the thundering herd. Instead of relieving pressure on your servers, the synchronized retry wave recreates it. The standard mitigation is for clients to add a small random delay (jitter) to their retry time — for example, retry after Retry-After + random(0, 1) seconds. This spreads retries across time and prevents a synchronized flood. Some well-designed APIs return slightly randomized Retry-After values to protect against clients that do not implement jitter themselves.

Other Rate Limiting Algorithms#

Token Bucket is the most widely used, but understanding the alternatives helps you choose the right tool for each situation.

AlgorithmHow It WorksAllows Bursts?Best ForKey Drawback
Token BucketTokens added at fixed rate; each request consumes one token; requests rejected when bucket emptyYes — up to bucket capacityAPIs, AI inference endpoints, per-user limitsTwo parameters to tune; needs Redis for distributed accuracy
Leaky BucketRequests enter a queue and are processed at a constant fixed rate; requests that exceed the queue capacity are droppedNo — output rate is constantNetwork traffic shaping, smoothing output to a downstream serviceNo burst accommodation; legitimate bursty clients get throttled unfairly
Fixed Window CounterCount requests in fixed time windows (e.g., every 60 seconds); reject when count exceeds limitYes — at window boundarySimple, low-traffic scenarios where approximate enforcement is acceptableBoundary attack: a client can send 2× the allowed rate by timing requests around window boundaries
Sliding Window CounterHybrid: uses a weighted count from the previous window to smooth boundary effectsPartial — smoothedMost production HTTP APIs — balances accuracy with memory efficiencySlightly approximate; still not perfectly precise
Sliding Window LogStores the timestamp of every request; counts requests in the trailing window on each checkNo — precise enforcementWhen exact precision is required and traffic volume is lowHigh memory cost — stores a timestamp per request per client

The boundary attack problem with Fixed Window is worth understanding: if your window is 12:00:00–12:01:00 with a limit of 100 requests, a client can legally send 100 requests at 12:00:59 and another 100 at 12:01:00 — 200 requests within a single second, yet fully within the rules. The Sliding Window Counter solves this by blending the previous window's count based on how far through the current window you are, making boundary exploitation impractical.

Rendering diagram...
Rendering diagram...

The Full Traffic Path: How It All Fits Together#

In a production system, these components stack in layers. A request passes through each layer in sequence before reaching your application code. Understanding this path helps you know where each concern belongs — routing logic at the load balancer, rate limiting at the gateway, business logic in the application:

Rendering diagram...

The key insight from this diagram: rate limiting belongs at the load balancer or API gateway layer — not inside your application code. Placing it in application code means every request that hits the rate limit has already passed through your web framework and consumed application resources. By enforcing limits at the edge, you reject over-limit requests before they touch your application at all.

AI Agents and Load Balancing: What to Watch For#

AI agents generate application code — they do not configure infrastructure. When you prompt an agent to build a REST API, it will write correct route handlers and business logic, but with no awareness that the service will eventually run as multiple instances behind a load balancer. This creates a predictable set of omissions that you, as the architect, need to address explicitly — in your prompts, your design documents, or both.

What AI Does by DefaultWhat You Need to Specify
Stores session state in process memory (e.g., a JavaScript Map)"Session state must live in Redis, not in-process memory — we run multiple instances behind a load balancer"
Writes application-level rate limiting code that only tracks state per-process"Rate limiting must use a shared Redis store so all instances enforce the same limit per client"
Returns no X-RateLimit headers on API responses"Return X-RateLimit-Limit, X-RateLimit-Remaining, and Retry-After headers on every API response"
Health endpoint just returns 200 OK with no dependency checks"The /health endpoint must verify DB connectivity and Redis connectivity — return 503 if either is unreachable"
Hardcodes the number of instances or assumes a single server"Design for horizontal scaling: no shared in-memory state, all state externalized to Redis or the database"

The most important constraint to establish upfront is statelessness. A stateless application server holds no user-specific data in memory. Every request from the same user can be served by any server instance, because all shared state (sessions, rate limit counters, caches) lives in an external store like Redis.

Consider what happens without this: a user logs in, and their session is stored in Server 1's memory. Their next request is routed by the load balancer to Server 2, which has no record of their session — they appear logged out. This kind of bug is hard to reproduce locally (where you run only one instance) and only surfaces in production under load. Stateless servers avoid this class of problem entirely: add or remove instances without affecting users, because no server holds state that another server would be missing.

When prompting an agent to build any user-facing feature, include: "This service runs as multiple stateless instances behind a load balancer. All session state, rate limit counters, and shared data must live in Redis. Do not store anything in process memory that needs to be visible to other instances."

Summary#

ConceptThe Rule
L4 load balancingRoutes by IP/port — fast, protocol-agnostic, blind to content. Use for databases, game servers, or maximum throughput scenarios.
L7 load balancingRoutes by URL path, headers, cookies — smart, HTTP-aware, enables microservice routing, SSL termination, and canary deployments. Default choice for web applications.
Round RobinSimple default for homogeneous servers with uniform request cost.
Least ConnectionsBetter than Round Robin when requests have variable processing time (AI inference, streaming). Servers that finish quickly naturally receive more traffic.
IP HashSticky sessions: same client always hits the same server. Necessary for stateful apps, but breaks with shared NAT.
Health checksL7 health checks verify the application is functional, not just that the port is open. Check all critical dependencies in your /health endpoint — return 503 if any are unreachable.
Token BucketRate limiting algorithm: bucket capacity controls burst size, refill rate controls sustained throughput. Use Redis for distributed accuracy.
HTTP 429Return 429 Too Many Requests with a Retry-After header when rate limits are exceeded. Clients should add jitter to retry delays to avoid synchronized retry floods.
Sliding Window CounterMore accurate than Fixed Window for rate limiting; avoids the boundary-exploit problem where a client can send 2× the allowed rate in a single second.
StatelessnessServers behind a load balancer must hold no in-process user state. Externalize all shared state to Redis. AI agents don't enforce this — you must specify it explicitly.

Load balancers and rate limiters are infrastructure — they sit outside your application code and are configured rather than programmed. AI agents will not add them automatically. The engineer's job is to design the traffic layer upfront: how many instances, which balancing algorithm, where rate limits are enforced, and how health checks are defined. Get these right in the spec before prompting, and the agent's generated application code will slot into the architecture cleanly. Skip them, and you will have application code that works perfectly on one server and breaks as soon as you try to run two.

Sources: