Database Replication

In the previous section, we established that databases are much harder to scale horizontally than application servers — and that the first step before sharding is almost always to add read replicas. This section explains exactly what that means, how it works, and what can go wrong.

Database replication is the process of keeping identical copies of your data on multiple database servers. The primary goal is twofold: improve availability (so your system survives a server failure) and improve read capacity (so read queries can be spread across multiple machines instead of hammering a single one).

It is, by a wide margin, the most overlooked database scaling technique — most developers skip it entirely and reach straight for sharding without ever trying replication first. In practice, a well-configured primary with a few read replicas can handle enormous traffic. The vast majority of production systems never need to go further.

The Primary/Replica Setup#

The dominant model for relational databases is called single-leader replication, also known as primary/replica or leader/follower. The rules are straightforward:

There is exactly one primary (leader). All writes — every INSERT, UPDATE, DELETE — must go to the primary.
There are one or more replicas (followers). Each replica receives a continuous stream of changes from the primary and applies them in the same order. Replicas are read-only from the application's perspective.
Read queries can be sent to any replica (or the primary). Write queries can only go to the primary.

Primary/Replica Architecture

The primary receives all writes and streams changes to replicas. The application routes writes to the primary and reads to replicas. This is the standard starting point for database scaling.

Rendering diagram...

How Replication Actually Works Under the Hood#

Databases use a low-level change log — called the Write-Ahead Log (WAL) in PostgreSQL and the binary log (binlog) in MySQL — as the foundation for replication.

Every change to the database (every INSERT, UPDATE, DELETE, even schema changes) is first written to this sequential log before it modifies any actual data pages. This log is the single source of truth for what happened and in what order. Replication works by shipping this log to replicas, which then replay it in the exact same order.

When you add a new replica, the setup follows four steps:

Take a consistent snapshot of the primary's data at a known log position (without locking the database in production systems).
Copy the snapshot to the new replica.
The replica connects to the primary and requests all log entries that occurred after the snapshot position.
The replica catches up on the backlog, then continues receiving the live log stream in real time.

In PostgreSQL, each position in the log is identified by a Log Sequence Number (LSN) — a monotonically increasing byte offset. In MySQL, the equivalent is a binlog coordinate (file name + byte offset) or, in modern setups, a GTID (Global Transaction Identifier) that is independent of any specific file. These markers let replicas know exactly where they are in the stream and resume correctly after network interruptions.

Synchronous vs. Asynchronous Replication#

The most important configuration choice in any replication setup is whether replication is synchronous or asynchronous. This choice directly controls the trade-off between write latency and data durability. The diagrams below compare both modes for a single write request:

Asynchronous

Rendering diagram...

Synchronous

Rendering diagram...

	Asynchronous (default)	Synchronous
Write latency	Low — primary acks the client immediately	Higher — primary waits for at least one replica to confirm
Data durability	Possible data loss if primary crashes before the replica receives the change	Zero data loss — a confirmed replica is always available
Availability	High — a slow or unavailable replica never blocks the primary	Lower — if the synchronous replica goes down, writes on the primary block until it recovers
Default in	PostgreSQL, MySQL, most cloud-managed databases	Must be explicitly configured in all major databases
Best for	Most web applications where occasional data loss of milliseconds is acceptable	Financial transactions, inventory counts, anything where data loss has serious consequences

The practical default is asynchronous replication. For most web applications, the risk of losing a few milliseconds of writes during a catastrophic primary failure is an acceptable trade-off for the performance and simplicity benefits. The key phrase is "catastrophic primary failure" — not every request, just the rare case of a server crash with unsynced writes.

Semi-synchronous replication (MySQL's middle ground) waits for exactly one replica to acknowledge receiving the change — but not fully applying it. This costs exactly one network round trip per write and eliminates the risk of complete data loss, without requiring all replicas to be fully synchronous. It is a reasonable default for MySQL production setups that need stronger durability guarantees without the full write-latency cost of synchronous replication.

PostgreSQL's synchronous_commit setting exposes even finer control, letting you configure per-transaction what "confirmed" means — from local (only local disk flush) to remote_apply (replica has applied the change). This allows critical transactions (e.g., a payment confirmation) to use a stricter setting while routine traffic uses the fast default.

Replication Lag: The Invisible Tax#

With asynchronous replication, replicas are always slightly behind the primary. This delay — from when a write commits on the primary to when it becomes visible on a replica — is called replication lag.

In a healthy same-region setup, lag is typically measured in milliseconds. Under stress (high write volume, large batch operations, resource-starved replicas, or cross-region distance), lag can grow to seconds or even minutes.

Why Lag Grows#

Network distance: A replica in a different region is always farther away. Cross-region replication inherently has higher latency than co-located nodes.
Write volume: If the primary generates writes faster than the replica can apply them, a backlog accumulates.
Large transactions: A single UPDATE that modifies 10 million rows produces a large burst of replication work that causes a temporary lag spike.
Replica under load: A replica serving heavy read traffic is also trying to apply the replication stream. Read-heavy replicas can fall behind because they have fewer CPU and I/O resources available for applying log events.

Three Consistency Problems Caused by Lag#

Replication lag isn't just a performance number — it creates correctness problems in your application if you're not careful.

Replication Lag Consistency Anomalies

Three classes of consistency bugs that appear when reads are routed to lagged replicas. Each has a specific mitigation. Understanding these anomalies is essential before routing production traffic to replicas.

Rendering diagram...

Read-Your-Writes (also called read-after-write consistency) is the most commonly encountered problem. A user writes data (e.g., updates their profile photo), then immediately reads from a lagged replica that hasn't yet received the change — they see the old photo. The fix: after a user performs a write, route their reads for that specific resource to the primary for a short window (typically 1–5 seconds), then fall back to replicas. Alternatively, you can always read the current user's own profile from the primary and everything else from replicas.

Monotonic Reads occur when successive reads from the same user land on different replicas with different lag levels. The user reads a post, sees 47 comments. Refreshes, gets routed to a more-lagged replica, sees 43 comments. The comment count appeared to decrease — time ran backwards. The fix: pin each user session to a specific replica using consistent hashing or a session-level header, so all reads in a session go to the same node.

Consistent Prefix Reads happen when writes arrive at a replica out of causal order. Imagine a forum thread where User A posts a question, and User B posts a reply that references it. If the replica receives User B's reply before User A's original question, a reader sees the reply without the post it responds to — a causally impossible sequence. This is most common in distributed databases with multiple leaders or partition-level logging. The fix is to ensure causally related writes always route through the same node so their order is preserved.

Monitoring Replication Lag#

Always instrument lag before routing significant traffic to replicas.

PostgreSQL:

-- On the primary: see each replica and its current lag
SELECT application_name, state, write_lag, flush_lag, replay_lag
FROM pg_stat_replication;

-- On a replica: lag in seconds since last replayed transaction
SELECT EXTRACT(EPOCH FROM now() - pg_last_xact_replay_timestamp()) AS lag_seconds;

MySQL:

-- On a replica: check Seconds_Behind_Source (MySQL 8.0+)
SHOW REPLICA STATUS\G

Set up an alert when lag_seconds exceeds your tolerance threshold (commonly 5–30 seconds for web apps, 0 for financial systems using synchronous replication).

Failover: Surviving a Primary Failure#

The second major benefit of replication (beyond read scaling) is high availability — the ability to survive a primary database failure with minimal downtime. When the primary goes down, a replica can be promoted to take its place. This process is called failover.

Automated Failover

When the primary fails, a monitoring agent detects the failure, elects the most up-to-date replica as the new primary, and redirects the application's connection. Well-managed failover completes in under two minutes in most cloud environments.

Rendering diagram...

The Split-Brain Problem#

The most dangerous failure scenario in replicated databases is split-brain: a situation where both the old primary and the newly promoted replica each believe they are the current leader and accept writes simultaneously. The result is two diverging databases with no way to automatically reconcile them.

Split-brain typically happens during a network partition: the primary becomes unreachable from the cluster manager (but isn't actually dead), the cluster manager promotes a replica, and then the network recovers — leaving two active primaries.

Preventing it requires one of two mechanisms:

Quorum-based election: The cluster manager requires a majority (n/2 + 1) of nodes to agree before promoting a new primary. A node isolated by a network partition can't achieve quorum alone and can't promote itself.
STONITH ("Shoot The Other Node In The Head"): Before the new primary starts accepting writes, the cluster management system forcibly powers off or network-isolates the old primary, guaranteeing only one can run.

In practice, use a managed database service (AWS RDS Multi-AZ, Aurora, Google Cloud SQL HA) unless you have a strong operational reason to self-manage. These services implement all of the above correctly and have been tested at scale. Failover in RDS Multi-AZ completes in approximately 60–120 seconds; Aurora completes in under 30 seconds.

Replication Topologies: Beyond the Basics#

Everything above describes the most common pattern: one primary, one or more replicas, all replicas connect to the primary. This is single-leader replication. There are two other topologies worth knowing about.

Topology	Who Can Accept Writes	Consistency Model	Examples	Complexity
Single-Leader (Primary/Replica)	One node only (the primary)	Strong consistency — all writes are ordered through one node; no write conflicts	PostgreSQL, MySQL, Oracle, MongoDB	Low
Multi-Leader (Active-Active)	Any leader node	Eventual consistency — concurrent writes to different leaders can conflict and must be resolved	MySQL Galera Cluster, MySQL Group Replication, CouchDB	High
Leaderless (Dynamo-style)	Any node	Tunable via quorum (w + r > n for strong; otherwise eventual)	Cassandra, DynamoDB, Riak	High

Single-leader is what you use for PostgreSQL, MySQL, and most relational databases. It is simple, correct, and avoids write conflicts entirely because only one node can receive writes.

Multi-leader allows writes to be accepted at multiple nodes simultaneously — useful for geographically distributed systems where routing all writes to a single region would add unacceptable latency. The trade-off is that write conflicts are now possible (two leaders modify the same row at the same time) and must be resolved with strategies like Last-Write-Wins or application-defined merge logic. Conflict resolution is genuinely hard to get right.

Leaderless systems (like Cassandra and DynamoDB) eliminate the concept of a primary entirely — any node can accept any read or write. Consistency is maintained through quorum: with n total replicas, a write confirmed by w nodes and a read that consults r nodes are guaranteed to share at least one node in common — meaning the read always sees the latest write — as long as w + r > n. A common configuration is n=3, w=2, r=2 — tolerating one node failure for both reads and writes.

For most developers building standard web applications on relational databases, single-leader is the right choice and the topology you'll encounter most often. Multi-leader and leaderless are patterns to understand for distributed systems interviews and for when you're evaluating NoSQL databases.

What Replication Cannot Fix#

Replication is a powerful tool, but it has hard limits. Understanding what it cannot do is as important as knowing what it can.

Problem	Does Replication Help?	Why
High read volume	Yes — route reads to replicas	Each replica can serve a full set of read queries; adding more replicas scales read capacity proportionally
Primary unavailability / crashes	Yes — promote a replica	A replica can be elected as the new primary within minutes
High write throughput	No	All writes still go to one primary; replicas receive all those writes too (they don't reduce primary write load)
Dataset too large for one machine	No	Every replica holds the full dataset — replication multiplies storage costs, it doesn't partition data
High write volume with concurrent reads	Partial — reads can be offloaded to replicas, but write throughput is unchanged	Routing reads to replicas reduces pressure on the primary, but all writes still flow through it — the write bottleneck remains

The two problems replication cannot solve — write throughput and dataset size — are exactly what sharding (horizontal partitioning) addresses. Sharding splits the dataset across multiple independent nodes, each owning a subset of the data, so both writes and storage are distributed.

The right sequence before reaching for sharding:

Rendering diagram...

Most production systems never reach Phase 3. A well-tuned primary with read replicas can handle hundreds of millions of users for read-heavy applications. Sharding introduces significant operational complexity — query routing, cross-shard joins, hot-key problems — and should be treated as a last resort reached only when you have genuinely exhausted read replica scaling and query optimization.

What AI Agents Get Wrong About Replication#

When you ask an AI agent to generate database access code, it will almost always produce code that sends every query — reads and writes alike — to a single database connection. This is correct for development, but misses the production architecture.

AI-Generated Database Code and Replication

AI agents generate functionally correct database code for a single connection. In a replicated setup, this means all reads still hammer the primary — defeating the entire purpose of having replicas. The fix is simple but requires explicit architectural guidance.

Rendering diagram...

A practical note on Prisma: Prisma supports read replicas via the @prisma/extension-read-replicas package, which routes read operations to a replica URL while sending writes to the primary. Adding it requires a few lines of configuration. The key step after enabling it is to audit any code path where a user writes and then immediately reads their own write — those reads may need to be pinned to the primary. Ask the AI to help you find those patterns after enabling the extension.

Summary#

Concept	Key Point
What replication is	Maintaining identical copies of your data on multiple servers — one primary for writes, one or more replicas for reads and failover
Sync vs. async	Async (default): fast writes, small risk of data loss on crash. Sync: slower writes, zero data loss. Most applications use async and accept the trade-off.
Replication lag	Replicas are always slightly behind the primary. Lag causes three classes of bugs: read-your-writes, monotonic reads, and consistent prefix reads. Understand and mitigate before routing production reads to replicas.
Read replicas	Route SELECT queries to replicas to multiply read capacity without scaling the primary. Most systems with a 80/20 read/write ratio benefit significantly.
Failover	When the primary fails, a replica is promoted. Use managed database HA (RDS Multi-AZ, Aurora) unless you have a reason to self-manage. Test your failover — an untested procedure is an unknown one.
Split-brain	Two nodes both believing they are primary — the most dangerous failover failure. Prevented by quorum-based election and STONITH fencing.
Replication limits	Replication scales reads and improves availability. It does not scale write throughput or reduce dataset size. Those problems require sharding.
AI agent gap	AI generates code for a single database connection. In a replicated setup, explicitly instruct the AI to implement read/write splitting and handle the read-after-write consistency case.

Replication is the most impactful, least complex database scaling technique available. Before adding architectural complexity — caching layers, sharding, NoSQL migrations — add a read replica and route your reads. The next section covers what to do when you've done that, and it still isn't enough: sharding, partitioning, and the hot-key problem.

Sources:

PreviousScaling Up vs. Scaling Out

NextScaling Data

Database Replication

Primary/Replica Architecture

Replication Lag Consistency Anomalies

Automated Failover

AI-Generated Database Code and Replication

Arch Advisor