Scaling Data: Sharding, Partitioning, and the Hot Key Problem
In the previous section, we established that replication solves two problems: spreading read load and surviving primary failures. But replication has hard limits — all writes still flow through one primary, and every replica stores the full dataset. When you hit those limits, you need sharding: splitting the data itself across multiple independent database servers, so that both the write load and the storage are divided among them.
This section explains what sharding is, how to choose a sharding strategy, what consistent hashing solves, and why the hot key problem is one of the most dangerous traps waiting for you on the other side.
Partitioning vs. Sharding: Clearing Up the Terminology#
These two terms are used interchangeably in most engineering discussions, but there is a precise distinction worth knowing.
Partitioning is the general concept of dividing a dataset into smaller segments. It can happen within a single database server — PostgreSQL's native table partitioning, for example, splits one large table into smaller physical sub-tables on the same machine. This improves query performance and maintenance (you can drop an entire partition to archive old data instantly) without adding any network complexity.
Sharding is horizontal partitioning across multiple independent database servers. Each shard is a fully self-contained database instance that owns a distinct subset of the data. No single server holds the complete dataset — it is genuinely distributed across the fleet.
| Partitioning (within one server) | Sharding (across servers) | |
|---|---|---|
| What it splits | One table into sub-tables on the same machine | The dataset across multiple independent database servers |
| Primary benefit | Faster queries, easier archiving, better index efficiency | Higher write throughput, unlimited horizontal storage growth |
| Complexity | Low — the database handles it transparently | High — your application must route queries to the correct shard |
| Failure scope | Server failure still loses all data (use replication) | Server failure only loses one shard's data |
| When to use | Tables with hundreds of millions of rows on a single server | Write bottleneck or dataset too large for any single server |
In practice, "sharding" is what most engineers mean when they say their system is "partitioned across nodes." The rest of this section focuses on sharding — distributed partitioning across multiple servers — because that is where the hard engineering problems live.
Horizontal vs. Vertical Partitioning#
Before choosing a sharding strategy, it helps to understand the two axes along which data can be split:
Vertical partitioning splits a table by columns — moving infrequently accessed or bulky columns (like profile bios or preference JSON blobs) into separate tables or even separate databases. This is essentially an extension of normalization. Your application still talks to one server; you have just reorganized what lives on it. It is a useful, low-risk first step, but it does not help when the table itself has too many rows.
Horizontal partitioning (sharding) splits a table by rows — each shard holds a different subset of records, identified by a shard key (also called a partition key). This is the technique that scales write throughput and storage capacity, and it is the one that introduces serious complexity into your system.
Sharding Strategies#
The central question in sharding is: given a shard key (e.g., user_id), which shard does this row live on? There are three main strategies.
Range-Based Sharding
Data is divided into contiguous ranges of the shard key value. Shard 1 holds user IDs 1–1,000,000, Shard 2 holds 1,000,001–2,000,000, and so on. Simple to understand, but creates uneven load over time.
Hash-Based Sharding
A hash function is applied to the shard key, and the result (modulo the number of shards) determines the target shard. This distributes data evenly but makes range queries expensive and resharding disruptive.
Directory-based sharding is a third approach: a separate lookup service (a database or key-value store) explicitly maps each shard key to its shard. When the application receives a query for user_id=1234, it first consults the directory — "which shard holds user 1234?" — and is told "Shard 3." This is the most flexible approach: you can relocate individual keys between shards without any key remapping (just update the directory entry). The downside is that the directory itself becomes a bottleneck and a single point of failure. It is most commonly used in systems that need fine-grained control over data placement, such as multi-tenant SaaS platforms that want to co-locate all data for one customer on one specific shard.
Consistent Hashing: Solving the Resharding Problem#
Naive hash sharding (hash(key) % N) has a critical flaw: when you add or remove a shard (changing N), almost every key gets remapped to a different shard. In a system with 4 shards and 1 million keys, adding a 5th shard with % 5 causes roughly 80% of all keys to point to a different shard than before. That means a massive, disruptive data migration just to add one node.
Consistent hashing solves this. The idea:
- Both shard nodes and data keys are hashed onto the same circular space (the "ring"), with positions ranging from 0 to 2³². Think of it like a clock face, but with billions of tick marks instead of 12.
- To find a key's shard, start at the key's position on the ring and move clockwise until you reach the first shard node. That shard owns the key. In the diagram above, Key D (hash=100) wraps around past the end of the ring and lands on Shard 1 (hash=20) — the first node it encounters going clockwise.
- When you add a shard, it is inserted at a position on the ring and takes over only the keys that fall between it and the previous shard. All other shards are completely unaffected.
- When you remove a shard, its keys are inherited by the next shard clockwise. Again, only one shard's worth of data needs to move.
With consistent hashing, adding or removing a shard migrates only about 1/N of the total data — a dramatic improvement over the near-total remapping that naive modulo hashing requires.
Virtual nodes (vnodes) extend consistent hashing to fix the uneven load problem. In a basic consistent hashing ring, each physical shard occupies a single position, which can leave one shard responsible for a disproportionately large arc of the ring. Vnodes give each physical shard multiple positions on the ring — typically 100–200 virtual positions per shard. The result is a much more even distribution of keys, and when a shard is added or removed, its load is drawn from all existing shards rather than from just one neighbor. Cassandra and DynamoDB both rely on virtual nodes internally.
| Approach | How Keys Are Distributed | Adding a Shard Remaps... |
|---|---|---|
Naive modulo hash(k) % N | Even (if hash is good) | ~(N−1)/N of all keys — almost everything moves |
| Consistent hashing | Even (if ring positions are spread) | ~1/N of all keys — only one neighbor's keys move |
| Consistent hashing + vnodes | Very even (multiple ring positions per shard) | ~1/N of all keys, drawn evenly from all shards |
The Hot Key Problem#
You've sharded your data. Traffic is distributed across shards. Everything looks balanced in your monitoring dashboards — until a celebrity posts on your platform, a product goes viral, or a daily cron job hammers one particular record. Suddenly, one shard is receiving 90% of the traffic while the other seven sit nearly idle. This is the hot key problem (also called a hotspot).
The fundamental insight here is that sharding distributes data, but it cannot distribute popularity. You can split a billion rows across a hundred shards perfectly evenly, and still have one shard melt down because a single record within it is overwhelmingly more popular than everything else.
The Hot Key Problem
A hot key occurs when a disproportionate amount of reads or writes target a single shard key — overloading that shard regardless of how many shards exist. The famous example: a celebrity with 50 million followers posts something. Every follower's request hits the one shard that holds that celebrity's data.
How Hot Keys Happen#
Popularity skew is the most common cause. In any large system, a small fraction of items receives the vast majority of traffic — this is Zipf's law in action. In practice, the top 0.01% of users, products, or posts can generate more traffic than the remaining 99.99% combined. No amount of sharding can fix this if a single key is genuinely that popular.
Temporal clustering is another cause. If your shard key is a timestamp or date, all writes for "today" land on the most recent shard, and all reads of recent data also hit that shard. One shard will be perpetually hot as long as "now" exists as a concept in your dataset.
Sequential keys cause write hotspots. Auto-incrementing IDs mean every new row is inserted into the last shard. Across millions of writes per day, this shard receives essentially all insert traffic while every other shard handles only reads of historical data. This is the single most common anti-pattern for shard keys.
Four Ways to Fix a Hot Key#
| Strategy | How It Works | Best For | Trade-off |
|---|---|---|---|
| Add a cache | Put Redis or a CDN in front of the hot key. Most reads are served from cache without touching the database at all. | Read-heavy hot keys with data that can tolerate brief staleness (product pages, user profiles, post counts) | Introduces cache invalidation complexity — you must update or expire the cache whenever the underlying data changes |
| Key splitting (write fan-out) | Append a random suffix to the shard key: product:123 becomes product:123:0 through product:123:9. Writes are distributed across all 10 sub-keys. On read, query all 10 sub-keys and merge the results (e.g., sum the counts). | Write-heavy keys like view counters, like counts, or inventory adjustments | Reads now require querying multiple sub-keys and aggregating — increases read complexity and latency |
| Dedicated shard replicas | Add read replicas specifically for the hot shard (not globally). Route that shard's read traffic to these dedicated replicas. | Read-heavy hotspots where the data changes too frequently to cache effectively | Per-shard replicas add infrastructure cost and operational overhead |
| Move the hot key | Detect the hot key and explicitly migrate it to its own isolated shard or a dedicated cache cluster, separate from the regular sharding scheme. | Permanently popular keys that need long-term isolation (e.g., a platform's most active accounts or top-selling products) | Requires a directory service or special-cased routing logic; adds ongoing operational complexity |
The typical production playbook: start with caching (cheapest and fastest to implement — a cache hit never touches the database). If the key is write-heavy and can't be cached, use key splitting. If the key must be consistent and is updated frequently, add dedicated replicas for that shard. As a last resort, isolate the key to its own infrastructure.
Cross-Shard Challenges#
Sharding distributes data effectively, but most of the complexity it introduces comes from operations that need to touch data on multiple shards at once. Each of these is a problem that does not exist in a single-database setup.
Cross-Shard Joins#
In a single-database setup, a JOIN is a single SQL query that the database engine executes internally and efficiently. When data is sharded across five servers, no single query engine can see all five at once. Cross-shard joins must be assembled in the application layer: fetch the relevant data from each shard independently, then merge the results in your application code.
// Pseudocode: application-level cross-shard join
const userIds = [101, 205, 388]; // these IDs live on different shards
// Fan out — query each shard in parallel
const results = await Promise.all(
userIds.map(id => queryShardFor(id))
);
// Merge results in application code
const users = results.flat();
This works, but it is slower and more error-prone than a database JOIN. The standard mitigation is to design your shard key so that related data lives on the same shard. If users and their orders both use user_id as the shard key, all of a user's orders will reside on the same shard as the user — no cross-shard join needed for the most common access pattern.
Scatter-Gather Queries#
Some queries cannot be targeted to a specific shard. "Find all users who signed up in the last 7 days" has no shard key to route on — the answer could be spread across every shard. These queries must be sent to all shards (scatter), and the partial results must be collected and merged (gather). This is called a scatter-gather query.
Scatter-gather is expensive in two ways. First, its total latency equals the slowest shard's response time, not the average — one slow shard holds up the entire query. Second, the application must hold and sort the full merged result set in memory. The standard mitigation is to keep scatter-gather queries off your hot path: run them as background jobs, route them to a dedicated analytics replica, or maintain a secondary index that is not sharded.
Distributed Transactions#
ACID transactions — where a sequence of writes either all succeed or all fail — are straightforward within a single database. Across shards, they require a distributed transaction protocol, most commonly two-phase commit (2PC):
- Phase 1 (Prepare): A coordinator node sends a "can you commit?" message to all shards involved in the transaction. Each shard locks the affected rows and responds with "yes" or "no."
- Phase 2 (Commit/Abort): If every shard voted "yes," the coordinator sends a commit message to all of them. If any shard voted "no," the coordinator sends an abort message and all shards release their locks.
2PC is correct but slow — it requires multiple network round trips and holds locks across shards for the entire duration of the protocol. It is also fragile: if the coordinator crashes after Phase 1 but before Phase 2, the participating shards are stuck holding locks indefinitely, waiting for a commit or abort message that never arrives. Recovering from this state requires manual intervention or a timeout-based fallback.
For these reasons, most sharded systems are designed to avoid cross-shard transactions entirely by ensuring that any operation requiring atomicity touches only one shard. This is a fundamental constraint that should shape your shard key design from the very beginning — before any code is written.
Resharding: Adding Capacity Over Time#
Eventually, your shards will fill up or become overloaded, and you will need to add more. This process is called resharding (or rebalancing). It is one of the most operationally risky operations in a distributed database, because it involves moving large amounts of data between nodes on a live, production system.
Zero-Downtime Resharding
Adding shards to a live system requires moving data between nodes without dropping any in-flight requests. The standard approach is a double-write migration: keep the old shards serving traffic while copying data to the new layout in the background, then cut over once the new shards are fully verified.
The best way to minimize resharding pain is to over-provision logical shards early. Instead of starting with 4 physical shards and needing to reshard to 8 later, start with 64 logical shards distributed across 4 physical servers. When you need more capacity, add physical servers and move some of the existing logical shards onto the new hardware — no key remapping is required, just data movement. Because the key-to-shard mapping doesn't change (only the shard-to-server mapping does), the migration is far simpler and less risky. This is the approach used by Vitess (the MySQL sharding layer developed at YouTube) and is one of the core reasons DynamoDB relies on virtual nodes.
Choosing Your Shard Key: The Decision That Matters Most#
The shard key — the field used to determine which shard a row belongs to — is the most consequential design decision in a sharded system. A poor choice of shard key can create permanent hotspots, force every query to scatter across all shards, or make cross-shard transactions unavoidable. And unlike most technical decisions, it is very difficult and expensive to change after the system is live.
| Property | What to Look For | Anti-Patterns to Avoid |
|---|---|---|
| High cardinality | Enough distinct values to spread data across all shards (millions of user IDs, order IDs, etc.) | Boolean fields, status enums, or fields with few distinct values — they can create at most a handful of possible shards |
| Even distribution | Values are spread uniformly across their range — not clustered at one end | Auto-incrementing IDs or current timestamps — new writes always go to the last shard |
| Query alignment | Most of your common queries include the shard key, so you can route to one shard without scatter-gather | Sharding on user_id when most queries filter by product_id — almost every query becomes a scatter-gather |
| Co-location | Related data that's frequently accessed together shares the same shard key, so joins stay within one shard | Sharding users by user_id but sharding orders by order_id — user-with-orders queries always cross shard boundaries |
The most effective shard keys are typically entity IDs (user ID, tenant ID, organization ID): they have high cardinality, distribute evenly when hashed, and co-locate all data for one entity on one shard. In multi-tenant SaaS applications, sharding by tenant_id is a particularly strong pattern — it ensures all data for one customer stays on one shard, keeping queries efficient and data isolation clean.
What AI Agents Get Wrong About Sharding#
When you describe a scaling problem to an AI agent, it will often jump straight to recommending sharding as the solution — skipping replication, caching, and query optimization entirely. And when it generates sharded data access code, it consistently makes the same mistakes.
AI-Generated Code and Sharding
AI agents tend to suggest sharding too early and implement it incorrectly. The two most common gaps: generating a shard key without analyzing your actual query patterns, and writing data access code that ignores cross-shard consistency and transaction boundaries.
Summary#
| Concept | Key Point |
|---|---|
| Partitioning vs. sharding | Partitioning splits data within one server (logical); sharding splits data across multiple independent servers (distributed). Sharding is what you reach for when replication hits its limits. |
| Horizontal vs. vertical partitioning | Vertical splits by columns (moves wide, infrequently-accessed columns to separate tables); horizontal splits by rows (sharding). Only horizontal partitioning scales write throughput and storage. |
| Range sharding | Efficient for range queries; creates hotspots on sequential or time-based keys. Best suited for time-series data with natural archival patterns. |
| Hash sharding | Distributes data evenly; breaks range queries; naive modulo resharding is extremely disruptive. Use consistent hashing to limit data movement when shard counts change. |
| Consistent hashing | Maps keys and shards onto the same ring. Adding a shard moves only ~1/N of keys. Virtual nodes give each physical shard multiple ring positions for more even distribution. |
| Hot key problem | One shard receives disproportionate traffic regardless of overall data distribution. Caused by popularity skew, temporal clustering, or sequential keys. Fix with caching, key splitting, or dedicated replicas for the hot shard. |
| Cross-shard operations | Joins, transactions, and range queries that span shards must be assembled in application code. This is slow and complex. Design your shard key so common operations stay within one shard. |
| Resharding | Adding shards to a live system is operationally risky. Use double-write migration for zero downtime. Consistent hashing minimizes data movement. Over-provisioning logical shards early avoids frequent resharding. |
| Shard key design | The most consequential sharding decision. Requires high cardinality, even distribution, query alignment, and data co-location. Don't let an AI choose it — it requires deep knowledge of your access patterns. |
| AI agent gap | AI recommends sharding too early and chooses naive shard keys. Profile first, exhaust replication and caching, then specify the shard key yourself and use AI to implement the routing logic. |
Sharding is powerful, but it is also the point at which a system's complexity jumps by an order of magnitude. Every operation that was once trivial — a transaction, a join, a range query — now requires explicit thought about shard boundaries. The engineers who succeed with sharded systems are not the ones who shard earliest; they are the ones who stayed on a single, well-optimized database the longest, and only reached for sharding when every other option had been genuinely exhausted.
Sources:
- Understanding Database Sharding — DigitalOcean
- Database Sharding — ByteByteGo System Design 101
- A Comprehensive Guide to Database Partitioning — freeCodeCamp
- Consistent Hashing Explained — Tom White
- Consistent Hashing and Random Trees — Karger et al.
- Chapter 6. Partitioning — Designing Data-Intensive Applications (Shichao's Notes)
- How Vitess Handles Resharding — Vitess Docs
- DynamoDB's Approach to Table Partitioning — AWS
- Hot Partition Problem and Solutions — Cloudflare Blog
- Sharding Pinterest: How We Scaled Our MySQL Fleet — Pinterest Engineering