Coordination: Distributed Locking and Leader Election
In a single-server system, coordination is easy. One process holds a database connection, one scheduler runs cron jobs, and one thread updates a shared counter. The operating system serializes all of this for you — you never have to think about it.
Distributed systems break this assumption. When ten servers run the same application code simultaneously, they can all attempt the same operation at the same time: process the same scheduled job, decrement the same inventory counter, or make a decision about the cluster's state. Unlike threads on a single machine, these servers share no memory and can fail independently — a server can crash, pause, or lose its network connection while the rest of the cluster keeps running. Without explicit coordination, these concurrent and partially-failed operations corrupt each other in ways that are often silent and difficult to diagnose.
This section covers the two coordination primitives that appear in every serious distributed system:
- Distributed Locking — ensuring only one node can act on a shared resource at a time
- Leader Election — ensuring exactly one node is designated as the coordinator for a cluster
Both problems sound simple. Both have surprising failure modes that only surface in production.
Distributed Locking#
Why a Single-Server Solution Doesn't Work#
On a single server, you protect shared resources with mutexes or database row locks. These work because the lock state lives in one place — in memory or in a single database. When you scale to multiple servers, those local locks are invisible to every other machine. Each server thinks it holds exclusive access; in reality, all of them do.
The classic symptom is the double-execution bug. Imagine a cron job that runs every night to send invoices. If two servers both check a shared work queue at the same moment, both may pick up the same job, and every invoice gets sent twice. Or consider inventory management: two users buy the last item simultaneously. Both reads see quantity = 1, both writes set quantity = 0 — but neither prevented the other from completing the purchase. The item is oversold and no error is ever raised.
How Distributed Locks Work#
A distributed lock is a piece of state stored in a shared, external service — not in any individual server's memory. The protocol is straightforward:
- To acquire the lock, a server writes a key to the shared service using an atomic write-if-absent operation.
- If the write succeeds, the server holds the lock. If it fails (the key already exists), another server holds it — wait and retry.
- To release the lock, the server deletes the key.
- Every lock has a TTL (time-to-live): if the server that holds the lock crashes before releasing it, the key expires automatically so the lock doesn't remain held forever.
The critical requirement is that the acquire operation is atomic — there must be no gap between "check if the key exists" and "write the key." If those were two separate steps, two servers could both observe the key as absent and then both write it, each believing they hold the lock. On Redis, this is handled with a single SET key value NX PX ttl command (NX = write only if not exists; PX ttl = expire after milliseconds). Redis executes this as one indivisible operation due to its single-threaded command processing model.
Distributed Lock: Acquire and Release
A distributed lock is implemented by atomically writing a key to a shared store. Only the first server to write the key holds the lock. The TTL ensures automatic cleanup if the lock holder crashes. The token ensures only the original acquirer can release the lock.
The Hidden Danger: Process Pauses and Fencing Tokens#
Here is a subtle but critical failure scenario that TTLs alone cannot prevent. It's called the paused process problem, and it arises whenever a lock holder is delayed long enough for its TTL to expire — even though the process is still running:
- Server 1 acquires a lock with a 10-second TTL and begins a long database write.
- Server 1's Java virtual machine triggers garbage collection and pauses execution for 12 seconds (long GC pauses are a real and well-documented occurrence in production JVM systems).
- The lock's 10-second TTL expires. Server 2 acquires the lock legitimately.
- Server 2 begins its own write to the same database record.
- Server 1's GC pause ends. It resumes where it left off — it still believes it holds the lock — and overwrites Server 2's work.
This results in silent data corruption: no error is thrown, no exception is logged, and Server 2's changes are simply gone.
The correct solution is a fencing token: a strictly monotonically increasing number issued with each lock grant. Every write to the protected resource must include the fencing token, and the resource rejects any write whose token is lower than the highest it has already accepted. This way, a stale lock holder that wakes up late cannot overwrite newer work — its token is too old, and the storage layer refuses it.
etcd provides fencing tokens naturally: etcd maintains a global revision counter that it increments with every write. When a lock is acquired, the client receives the revision at which it took ownership — this revision serves as the fencing token. Because revisions only ever increase, any storage layer can compare two tokens and know which one is newer. ZooKeeper provides the same guarantee through its zxid (ZooKeeper transaction ID) — a globally monotonic identifier that increases with every transaction in the cluster.
Redis does not natively provide fencing tokens. The lock value is typically a random UUID, and random strings have no meaningful ordering — a storage layer cannot tell whether one UUID is "newer" than another. This is why Redis is appropriate for efficiency locks (preventing wasted duplicate work) but not for correctness locks (preventing data corruption in financial or inventory systems).
| Tool | Mechanism | Fencing Token? | Best For |
|---|---|---|---|
| Redis | SET NX PX (single-node) or Redlock (multi-node) | No — random UUID tokens are not ordered | Efficiency locks: distributed cron, rate limiting, cache stampede prevention |
| etcd | Lease + CAS transaction; lock key tied to a time-bounded lease | Yes — creation revision is globally monotonic | Correctness locks: schema migrations, job schedulers, coordinating stateful services |
| ZooKeeper | Ephemeral sequential nodes; lowest sequence number holds lock | Yes — ZooKeeper epoch + node sequence | Correctness locks: same as etcd; also common in Hadoop/Kafka ecosystems |
Implementing a Distributed Lock with etcd#
etcd's lock is built on two primitives: a lease (a TTL-bounded session that etcd tracks) and an atomic compare-and-swap transaction (write a key only if it doesn't exist). If the process holding the lock crashes and stops sending heartbeat keep-alives, etcd expires the lease and automatically deletes the lock key — other waiters are immediately notified via a watch.
# Python: etcd distributed lock using the etcd3 client
import etcd3
client = etcd3.client()
# Acquire the lock — etcd creates an internal lease (TTL=10s) and atomically
# writes the lock key only if it doesn't exist. If the process crashes without
# calling lock.release(), etcd expires the lease and frees the lock automatically.
lock = client.lock('my-resource', ttl=10)
acquired = lock.acquire(timeout=5) # wait up to 5 seconds to take the lock
if acquired:
try:
# The lock's underlying etcd revision serves as the fencing token.
# Pass it to your storage layer with every write so stale holders
# can be detected and rejected.
process_job()
finally:
lock.release() # explicit release is faster than waiting for TTL expiry
In ZooKeeper's model, the lock is acquired by creating an ephemeral sequential node under a shared path (e.g., the client writes /locks/resource- and ZooKeeper appends a unique sequence number, producing /locks/resource-0000000042). The client with the lowest sequence number across all nodes under that path holds the lock. Every other client watches the node with the sequence number just below its own — when that predecessor node is deleted (released or session-expired), the next client is notified and checks whether it now holds the lowest number. This "chain watching" design avoids the thundering herd problem: instead of every waiting client simultaneously storming the lock service the moment the leader releases, each watcher only wakes up when its immediate predecessor releases, spreading out the notifications.
Leader Election#
Why Distributed Systems Need a Leader#
Many coordination tasks require a single authoritative decision-maker. Consider:
- A database cluster where multiple replicas must agree on who is the primary — the single node authorized to accept writes.
- A job scheduler where exactly one node should assign tasks to workers, not all nodes simultaneously.
- A Kafka broker cluster where each partition needs exactly one "partition leader" to sequence incoming messages.
Without a designated leader, distributed systems suffer from split-brain — a condition where multiple nodes simultaneously believe they are in charge, each making independent decisions. The result is conflicting writes, duplicate work, and inconsistent state that can be extremely difficult to repair.
Leader election is the mechanism by which nodes in a cluster agree on exactly one leader. All other nodes become followers that defer to the leader's authority.
Leader Election: From Cluster Start to Steady State
When a cluster starts (or when the current leader dies), nodes run an election to designate a new leader. The leader handles all coordination work; followers replicate the leader's state. If the leader becomes unreachable, a new election begins automatically.
The Raft Algorithm: How Modern Leader Election Works#
Raft is the consensus algorithm behind etcd, Consul, and CockroachDB. It was designed to be understandable — a deliberate contrast to the earlier Paxos algorithm, which is notoriously difficult to reason about correctly.
Raft divides time into numbered terms. Each term begins with an election. Term numbers act as logical clocks: a node that learns a higher term number immediately acknowledges it's out of date and steps down.
Every node is in one of three states:
| State | What It Does | How It Transitions |
|---|---|---|
| Follower | Receives heartbeats and log entries from the leader; votes in elections | Starts as follower; becomes Candidate if no heartbeat is received within the election timeout |
| Candidate | Running for leader; requests votes from all other nodes in the current term | Becomes Leader if it wins a majority vote; reverts to Follower if it loses or hears from a valid leader |
| Leader | Accepts all writes; replicates log entries to followers; sends periodic heartbeats to reset their timers | Steps down to Follower if it discovers a higher term number |
The election timeout is randomly chosen by each follower (typically 150–300ms). When a follower hasn't heard from a leader by its timeout, it assumes the leader is dead, increments the term, transitions to Candidate, and starts an election. Randomization prevents multiple nodes from starting elections simultaneously and splitting votes indefinitely.
Once elected, the leader sends heartbeat messages to all followers every 50ms or so. Followers that receive heartbeats reset their election timers. If the leader dies, followers stop receiving heartbeats, eventually time out, and start a new election.
Log replication is how Raft keeps all nodes consistent: clients send writes only to the leader. The leader appends the entry to its local log, then sends it to all followers in parallel. Once a majority of followers confirm they've written the entry, the leader commits it — applies it to its state machine and acknowledges the client. This guarantees that any newly elected leader will have all committed entries, because it must have won a vote from at least one node in the previous majority.
ZooKeeper: Battle-Tested Coordination for the Hadoop Ecosystem#
ZooKeeper uses a similar but distinct protocol called ZAB (ZooKeeper Atomic Broadcast). Like Raft, ZAB elects a single leader and replicates all state changes through that leader in order. The key difference is that ZAB was designed specifically for high-throughput broadcast of state changes, and it includes an explicit synchronization phase when a new leader is elected: before accepting new writes, the fresh leader ensures all followers have caught up to the same state as the previous leader. This prevents any data from being lost or replayed out of order during leadership transitions.
ZooKeeper's data model is a hierarchical tree of znodes — nodes in a file-system-like structure. An ephemeral znode exists only for the lifetime of the client session that created it: if the client disconnects, ZooKeeper automatically deletes it. This property is the foundation of both distributed locking and leader election in ZooKeeper.
Leader election works by having each candidate create an ephemeral sequential znode under a common path (e.g., a client writes /election/candidate- and ZooKeeper produces /election/candidate-0000000001). The candidate with the lowest sequence number is the leader. Every other candidate watches the node with the next-lower sequence number — when that node disappears, the watcher checks whether it is now the leader. This chain-watching approach prevents the thundering herd problem that would arise if all candidates simultaneously watched a single "leader" node and rushed to claim leadership at the same instant.
| etcd / Raft | ZooKeeper / ZAB | |
|---|---|---|
| Protocol | Raft — log-based consensus with randomized timeouts | ZAB — atomic broadcast; leader-based ordered replication with explicit synchronization phase |
| Lock primitive | Lease key with CAS; fencing via creation revision | Ephemeral sequential node; fencing via epoch + sequence |
| Election primitive | Built-in election API via concurrency package | Ephemeral sequential nodes under a common path |
| Ecosystem | Kubernetes, Consul, CoreDNS, Vitess | Hadoop, Kafka (legacy), HBase, older distributed systems |
| Operational complexity | Simpler — designed to be embedded; used directly by Kubernetes | Requires separate ZooKeeper ensemble; historically complex to operate |
| Fencing tokens | Creation revision number — globally monotonic | zxid + epoch — globally monotonic |
Real-World Example: Kubernetes#
Kubernetes is one of the most widely deployed users of leader election in practice. Its controller manager and scheduler — the components that continuously reconcile the cluster's desired state with its actual state — each run as multiple replicas for high availability. But if all replicas actively reconcile simultaneously, they create conflicting decisions: one replica might evict a pod while another simultaneously schedules a replacement for the same pod.
Kubernetes solves this with leader election backed by etcd. Each replica races to create and hold a Lease object in the Kubernetes API (which persists its state in etcd). Only the replica that holds the lease is the active leader; it performs all reconciliation work. The others passively watch the lease and wait. If the active leader crashes or becomes unresponsive, its lease is not renewed and expires within seconds — another replica acquires it and takes over, typically with no human involvement and no downtime visible to end users.
# You can inspect the current leader of the controller manager:
kubectl -n kube-system get lease kube-controller-manager -o yaml
# Output shows the holder identity and renewTime:
# spec:
# holderIdentity: master-node-1_<uuid>
# leaseDurationSeconds: 15
# renewTime: "2026-03-12T10:23:44Z"
Coordination in AI Agent Systems#
The same coordination problems that appear in database clusters surface in multi-agent AI systems — just at a different level of abstraction.
A typical multi-agent coding assistant has an orchestrator (analogous to a Raft leader) that decomposes tasks and delegates to specialized worker agents. Without coordination:
- Two agents may concurrently read and patch the same file, with one agent's changes silently overwriting the other's.
- Multiple agents may simultaneously call a rate-limited API (e.g., an LLM provider), exhausting the quota and triggering cascading errors.
- A stale agent — one that paused due to a rate limit or network timeout — may resume and submit results based on an outdated view of the shared state, corrupting the work done while it was paused.
Coordination in Multi-Agent Systems
Multi-agent AI workflows face the same race conditions and split-brain risks as distributed databases. The patterns — orchestrator leadership, shared state locking, fencing via version numbers — are direct applications of the distributed coordination principles covered in this section.
A design principle from distributed systems applies directly here: agents should be designed as idempotent operations wherever possible. If an agent's action is accidentally executed twice — due to a lock failure, a retry, or a crashed orchestrator re-assigning the same task — the end result should be identical to running it once. This is the same idempotency principle that prevents double-charges in payment systems: a payment marked as "processed" is ignored if submitted again. Build idempotency into agent tool implementations from the start; retrofitting it into a live system is costly and error-prone.
Summary#
| Concept | Core Problem | Key Mechanism | Main Risk |
|---|---|---|---|
| Distributed Lock (Redis) | Race conditions on shared resources | SET NX PX with unique token and TTL | No fencing tokens — paused processes can corrupt data after TTL expiry |
| Distributed Lock (etcd) | Race conditions with crash-safe recovery | Lease + CAS; lock key tied to TTL-bounded session | Requires etcd cluster; lock acquisition adds latency to every critical path |
| Distributed Lock (ZooKeeper) | Race conditions with FIFO ordering | Ephemeral sequential nodes; lowest sequence wins | More operationally complex; thundering herd avoided by sequential watching |
| Fencing Tokens | Paused lock holders corrupting data post-expiry | Monotonic token included in every write; storage rejects stale tokens | Requires storage layer cooperation to enforce — not automatic |
| Leader Election (Raft) | Split-brain and duplicate coordination work | Randomized timeouts + majority voting + term numbers | Brief unavailability during elections; election storms on flapping networks |
| Leader Election (ZAB) | Ordered state changes + split-brain prevention | Ephemeral sequential nodes + atomic broadcast | Minority partition goes read-only; odd cluster sizes required |
| Agent Coordination | Conflicting concurrent agent actions | Orchestrator pattern + versioned shared state + idempotent tools | Orchestrator crashes stall pipeline; rate-limit pauses mimic GC pauses |
The foundational insight of this section: coordination problems do not get easier as systems scale — they get more expensive to get wrong. A double-executed invoice at 100 users is embarrassing. At 10 million users, it's a regulatory incident. Design your coordination strategy deliberately: use Redis for efficiency locks where occasional duplicates are tolerable and harmless, use etcd or ZooKeeper with fencing tokens for correctness locks where data integrity is non-negotiable, and apply the same discipline to multi-agent AI systems as you would to any distributed service. The patterns are the same; only the vocabulary changes.
Sources:
- Distributed Locking: A Practical Guide — architecture-weekly.com
- How to do distributed locking — Martin Kleppmann
- Distributed Locks with Redis — Redis Documentation
- Distributed Locking Best Practices: Redis, Zookeeper, Etcd Explained — scalewithchintan.com
- Leader Election & Fencing in Real Systems — codefarm0.medium.com
- Leader Election in Distributed Systems: A Comprehensive Guide — medium.com
- 5 Best Leader Election Algorithms for System Design — designgurus.io
- What Is etcd? — IBM Think