Distributed File Storage (S3-like Object Storage)

Object storage lets you upload any file — a 2 KB JSON config, a 500 MB video, a 5 GB database backup — and retrieve it later using a simple key. You give the system a name (the key) and it gives you back the bytes, no matter how big the file is or how many files the system already holds. Amazon S3, Google Cloud Storage, and Azure Blob Storage all work this way.

Think of it like a planet-sized filing cabinet. You label each folder (the key), drop your file inside, and the cabinet stores it across thousands of drawers in dozens of buildings so that even if an entire building burns down, your file survives intact. You never need to know which drawer holds your file — you just ask for it by name and the system figures out where the pieces are.

What makes this deceptively hard is the combination of extreme durability (S3 promises 99.999999999% — eleven nines — meaning you would statistically lose one object out of 10 million over a 10,000-year period), massive scale (trillions of objects, exabytes of data), and the fundamental separation between metadata and data (where a file is vs. what a file contains are completely different problems at scale).

This case study follows the same eight-step framework:

  • Clarify constraints (Steps 1-2) — What does the system do, and how much data must it handle?
  • High-level design (Step 3) — What are the major components and how do they connect?
  • Deep dives (Steps 4-7) — How do the trickiest parts actually work?
  • Trade-offs (Step 8) — What did we give up, and when would we choose differently?

Step 1: Clarify Requirements#

Functional Requirements#

These describe what the system does.

FeatureDescriptionPriority
Upload objectUpload any file up to 5 GB using a bucket name and object key; the system stores it durablyCore
Download objectRetrieve a complete file by its bucket + key; support range reads (download bytes 1000-2000 of a file)Core
Delete objectRemove an object by key; storage is eventually reclaimedCore
List objectsList objects in a bucket with prefix filtering and pagination (like browsing a folder hierarchy)Core
Multipart uploadUpload large files (>100 MB) in parallel parts that are assembled server-side on completionCore
VersioningOptionally keep all previous versions of an object; retrieve or restore any version by version IDOptional
Lifecycle policiesAutomatically transition objects to cheaper storage tiers or delete them after N daysOptional
Pre-signed URLsGenerate time-limited URLs that allow unauthenticated users to upload or download a specific objectOptional

Non-Functional Requirements#

These describe how well the system works.

PropertyRequirementWhy It Matters
Durability99.999999999% (eleven 9s) — statistically lose 1 object per 10 million stored over 10,000 yearsData loss is permanent and catastrophic — customers trust the system with irreplaceable data like backups, medical images, and financial records
Availability99.99% (four 9s) — under 53 minutes of downtime per yearApplications depend on storage being reachable; even brief outages cascade to every service that reads or writes objects
LatencyTime to first byte under 100ms for downloads from the same region; under 200ms globally via CDN edge cachesObject storage backs websites, mobile apps, and data pipelines — users feel delays above 200ms
ScalabilityHandle billions of objects per bucket; support exabytes of total data across the systemCustomers range from a startup with 1,000 objects to an enterprise with 100 billion — the system must handle both without architectural changes
ThroughputSupport 100K+ upload requests/sec and 500K+ download requests/sec at peakBurst traffic patterns — a data pipeline ingesting millions of sensor readings, or a CDN pulling assets during a viral event — must not degrade service
ConsistencyRead-after-write consistency for new object PUTs; eventual consistency acceptable for overwrites and deletesA process that uploads a file and immediately reads it back must see the new version — not a stale or missing response

The single most important non-functional requirement is durability. Users can tolerate brief unavailability — they retry and it works a minute later. They cannot tolerate data loss. Every architectural decision in this system prioritizes durability above all else. Availability is second. Latency is third.

Step 2: Back-of-the-Envelope Estimation#

MetricCalculationResult
Objects storedAssumed medium-large deployment500 million objects
New uploads per dayAssumed10 TB/day
Average object size10 TB/day ÷ ~20M new objects/day~500 KB (median much smaller; large files skew the average)
Upload QPS (average)20M objects/day ÷ 86,400 seconds~230 uploads/second
Upload QPS (peak, 5x)230 × 5 (burst from data pipelines)~1,150 uploads/second
Download QPS (average)Reads dominate; assume 5:1 read-to-write ratio~1,150 downloads/second
Download QPS (peak, 5x)1,150 × 5~5,750 downloads/second
Total storage (Year 1)10 TB/day × 365 days~3.6 PB raw data
With erasure coding overhead (1.5x)3.6 PB × 1.5~5.4 PB total disk
Metadata per object~500 bytes (bucket ID, key, size, checksum, location pointers, timestamps)
Total metadata500M objects × 500 bytes~250 GB — fits comfortably in a distributed database
Network bandwidth (uploads)10 TB/day ÷ 86,400 seconds~120 MB/s sustained ingress
Network bandwidth (downloads, peak)5× upload bandwidth for peak reads~600 MB/s egress at peak

Key insight from the math: Metadata is small (250 GB) but is accessed on every single request — every upload, download, list, and delete starts by looking up metadata. The data itself is enormous (petabytes) but each object is accessed infrequently. This asymmetry is the reason the architecture must separate metadata from data into two distinct services optimized for their different access patterns.

Why erasure coding uses 1.5x, not 3x: Traditional replication stores three full copies of every byte — a 3x overhead. Erasure coding (which we will deep dive into in Step 4) achieves the same or better durability with only ~1.5x overhead by storing computed "parity" fragments instead of full copies. At petabyte scale, this difference saves millions of dollars in storage hardware per year.

Step 3: High-Level Design#

The foundational principle: separate the metadata path from the data path. When a client uploads a file, the metadata (where the file lives, its size, checksum) goes to one service, and the actual bytes go to another. This separation is why object storage scales to exabytes — the metadata service handles billions of small, fast lookups while the data service handles petabytes of large, sequential reads and writes independently.

Rendering diagram...

What each component does:

  • API Gateway — Authenticates requests (signature-based auth like AWS SigV4), enforces rate limits per account, and routes to the correct internal service. Also terminates TLS and handles request validation (is the bucket name valid? is the object key within length limits?).
  • Metadata Service — The brain of the system. For every object, it stores the mapping from (bucket_id, object_key) to the physical storage locations where the object's data fragments live. It handles LIST operations entirely on its own (no data service involved) and coordinates with the data service during uploads and downloads to locate data.
  • Metadata DB — A distributed key-value store (DynamoDB, Cassandra, or a sharded PostgreSQL cluster) that stores one row per object version. The partition key is bucket_id + object_key, giving even distribution and fast point lookups. At 500 million objects and ~500 bytes per row, the total dataset is ~250 GB — comfortably handled by any distributed database.
  • Data Service — Manages the actual reading and writing of file bytes to storage nodes. On upload, it splits the file into chunks, applies erasure coding, and distributes the fragments across storage nodes in different failure domains. On download, it reads the required fragments and reassembles the file.
  • Storage Nodes — Commodity servers with large local disk arrays (HDDs for capacity, SSDs for hot data). Each node manages its local disk space, reports health metrics, and serves read/write requests from the data service. Nodes are organized into failure domains — different racks, power supplies, and network switches — so that a single hardware failure cannot destroy all fragments of an object.

API Design#

EndpointMethodRequestResponse
PUT /{bucket}/{key}PUTObject bytes in body; metadata in headers (Content-Type, x-checksum-sha256)200 OK with ETag (MD5 or SHA-256 of stored object)
GET /{bucket}/{key}GETOptional Range header for partial reads200 OK with object bytes; 206 Partial Content for range requests
DELETE /{bucket}/{key}DELETEOptional x-version-id header to delete a specific version204 No Content
GET /{bucket}?prefix=photos/&marker=photo_099.jpg&max-keys=1000GETQuery params for prefix filter, pagination cursor, and page size200 OK with XML/JSON listing of matching keys, sizes, and timestamps
POST /{bucket}/{key}?uploadsPOSTInitiates a multipart upload200 OK with upload_id for subsequent part uploads
PUT /{bucket}/{key}?partNumber=3&uploadId=abcPUTOne part's bytes in body200 OK with ETag for that part
POST /{bucket}/{key}?uploadId=abcPOSTList of part numbers and ETags to finalize200 OK — object is now readable

Why PUT and not POST for uploads? PUT is idempotent — uploading the same object to the same key twice produces the same result (the second upload overwrites the first). This matches the semantics of object storage: the key is the identity, and the operation is "place this content at this key." POST is used only for operations that create server-side state with a new identifier (like initiating a multipart upload, which generates a unique upload_id).

Metadata Schema#

Object Metadata Record:
  bucket_id       UUID          -- the bucket this object belongs to
  object_key      VARCHAR(1024) -- the full key path (e.g. "photos/2026/may/sunset.jpg")
  version_id      UUID          -- unique per version (for versioned buckets)
  size_bytes      BIGINT        -- total object size
  checksum_sha256 VARCHAR(64)   -- integrity check; verified on every read
  content_type    VARCHAR(128)  -- MIME type (e.g. "image/jpeg")
  created_at      TIMESTAMPTZ
  storage_class   VARCHAR(16)   -- STANDARD, INFREQUENT, ARCHIVE
  fragment_map    JSONB         -- ordered list of {node_id, disk_id, offset, length}
                                --   for each erasure-coded fragment
  is_delete_marker BOOLEAN      -- for versioned buckets: marks a logical delete

  Primary Key: (bucket_id, object_key, version_id)
  Partition Key: bucket_id + object_key
  Sort Key: version_id (latest version served by default)

The fragment_map is the critical field — it tells the data service exactly which storage nodes hold which fragments of this object. A typical 1 MB object stored with 6+3 erasure coding has 9 entries in the fragment map, each pointing to a different storage node in a different failure domain.

Upload and Download Flows#

Rendering diagram...

Why data is written before metadata: The upload flow writes object bytes to storage nodes first, then writes the metadata record. If the process crashes after writing data but before writing metadata, the data fragments become orphans — unreferenced bytes on disk. This is harmless: the garbage collector (covered in Step 7) eventually reclaims them. The alternative — writing metadata first — is dangerous: the metadata would point to data that does not exist, causing download failures. Dangling data is safe; dangling metadata is broken.

Step 4: Deep Dive — Data Partitioning and Durability#

How do you store a file so that it survives disk failures, server failures, and even entire data center outages? This is the core durability problem, and there are two competing approaches.

Approach A: Triple Replication#

The simplest strategy: store three complete copies of every object on three different storage nodes in three different failure domains (different racks, different power supplies).

Rendering diagram...

This is how the Google File System (GFS) originally worked. It is simple to understand, simple to implement, and fast to read (any replica can serve the full file). But it costs 3x the raw storage — every petabyte of data requires three petabytes of disk.

Approach B: Erasure Coding#

Erasure coding is a mathematical technique borrowed from satellite communications. Instead of storing full copies, you split the original data into k data fragments and compute m parity (redundancy) fragments, producing k + m total fragments. You can reconstruct the original data from any k of the k + m fragments — meaning up to m fragments can be lost (due to disk failures, node crashes, or entire rack outages) and you still recover everything.

Think of it like a jigsaw puzzle with extra pieces. If you have a 6-piece puzzle and you create 3 additional "backup" pieces calculated from the original 6, you now have 9 pieces total. Even if you lose any 3 of those 9 pieces, the remaining 6 contain enough information to reconstruct the full picture.

Rendering diagram...

With a 6+3 scheme, the storage overhead is (6+3)/6 = 1.5x — half the cost of triple replication, with the ability to survive 3 simultaneous fragment losses.

Triple Replication vs. Erasure Coding

This is the most consequential durability decision in the entire system. Replication is simpler and faster for reads; erasure coding is dramatically cheaper and can achieve higher durability at scale. Almost every production object storage system at scale uses erasure coding.

Rendering diagram...

Step 5: Deep Dive — Metadata Design#

The metadata service is the brain of the entire system. Every operation — upload, download, delete, list — starts with a metadata lookup. If the metadata service is slow, the entire system is slow. If it is down, the entire system is down. If it loses data, objects become permanently inaccessible even though the data still exists on storage nodes.

Rendering diagram...

Why consistent hashing? With 500 million objects across potentially hundreds of metadata nodes, you need a way to determine which node stores the metadata for a given key without consulting a central directory. Consistent hashing maps each (bucket_id, object_key) to a position on a virtual ring, and the nearest metadata node clockwise on the ring owns that key. When a node is added or removed, only the keys adjacent to it on the ring are redistributed — roughly 1/N of all keys, rather than reshuffling everything. This makes scaling the metadata tier smooth and predictable.

LIST Operation: The Hidden Performance Challenge#

The LIST operation — "give me all objects in this bucket with prefix photos/2026/" — is fundamentally different from point lookups. A point lookup (GET /bucket/key) hits one partition. A LIST must scan across potentially all partitions because objects with the same prefix may hash to different nodes.

Rendering diagram...

Production systems optimize LIST by maintaining a secondary index sorted by prefix within each bucket, so that prefix-filtered listing requires scanning only the relevant range rather than all keys. DynamoDB handles this naturally with a sort key; Cassandra achieves it with clustering columns.

Metadata Store: SQL vs. NoSQL at Scale

The metadata store must handle billions of rows with sub-millisecond point lookups and support efficient prefix scans for LIST operations. The choice between SQL and NoSQL is driven by the scale of objects and the access pattern.

Rendering diagram...

Step 6: Deep Dive — The Upload Path and Multipart Upload#

The upload path must handle everything from a 1 KB config file to a 5 GB database backup, and it must guarantee that once the system acknowledges the upload, the data is durable — even if a server crashes one second later.

Small Object Upload (Single PUT)#

Rendering diagram...

Write-ahead logging (WAL): Before the data service confirms fragment writes to the metadata service, it logs the operation to a write-ahead log — a sequential, append-only file on stable storage. If the data service crashes mid-write, it reads the WAL on recovery to determine which fragments were successfully written and which need to be retried or cleaned up. This is the same technique databases use to guarantee durability across crashes.

Large Object Upload (Multipart)#

For files larger than 100 MB, a single PUT is risky — a network hiccup 4 GB into a 5 GB upload means starting over from zero. Multipart upload solves this by splitting the work into independently uploadable parts.

Rendering diagram...

Why multipart upload matters:

  • Parallelism — Upload 10 parts simultaneously to different storage nodes. Multiple TCP connections overcome single-connection bottlenecks (TCP slow start, per-connection server-side rate limits) and can better utilize available bandwidth. If each storage node throttles individual connections, parallel parts to different nodes bypass that limit.
  • Resilience — If part 7 of 10 fails, retry only part 7. Without multipart, a failure at 4.5 GB into a 5 GB upload means re-uploading the entire 5 GB.
  • Resumability — Parts already uploaded survive client crashes. The client can resume by querying which parts are already complete and uploading only the missing ones.

Single PUT vs. Multipart Upload

The upload strategy depends on object size. Small objects use a single PUT for simplicity; large objects use multipart for resilience and parallelism. The threshold is typically 100 MB — below that, the overhead of multipart (initiate, complete, track parts) is not justified.

Rendering diagram...

Step 7: Deep Dive — Consistency and Garbage Collection#

Object storage must answer a deceptively tricky question: when a client uploads a file and immediately reads it back, do they see the new file or the old one (or nothing at all)? And when a client deletes a file, when is the storage actually freed?

Consistency Model#

Rendering diagram...

Why the difference? A new object PUT creates a new metadata record. The system returns 200 OK only after the metadata is durably written to a quorum of metadata nodes. Any subsequent read will find this record — it either exists or it does not.

An overwrite updates an existing record. The old version's metadata may still be cached or replicated on some nodes. Until all replicas converge, different reads may see different versions. The convergence window is typically sub-second, but the system does not guarantee it.

Note on S3's evolution: Amazon S3 originally had eventual consistency for all operations, including new object PUTs. In December 2020, S3 upgraded to strong read-after-write consistency for all operations at no additional cost — a major engineering achievement that required rearchitecting the metadata tier. Our design follows the simpler approach of strong consistency for new PUTs and eventual consistency for overwrites, which is easier to implement and sufficient for most use cases.

Garbage Collection#

Garbage collection in a distributed storage system is more complex than memory garbage collection in a programming language, but the core idea is the same: find data that is no longer referenced and reclaim the space.

Rendering diagram...

The 72-hour grace period is critical. Without it, the garbage collector could delete fragments that belong to an upload that is still in progress. Imagine: a client starts a large multipart upload, the data service writes fragments to disk, but the metadata service has not yet recorded them. If the garbage collector runs during this window, it sees unreferenced fragments and deletes them — destroying an active upload. The grace period ensures that any in-flight operation has ample time to complete before its fragments become eligible for deletion.

Consistency and Garbage Collection Trade-offs

Strong consistency everywhere would simplify the programming model but would require synchronous coordination across all metadata replicas on every write — killing throughput and availability. Aggressive garbage collection would reclaim space quickly but risks deleting data that is still being referenced by in-flight operations. The design balances both with targeted consistency guarantees and a conservative GC grace period.

Rendering diagram...

Step 8: Trade-offs & Production Reality#

Every design decision involves a trade-off. Here is a consolidated view of the choices made in this design and when you would choose differently.

DecisionChoiceWhat We Gave UpChoose Differently When...
Durability strategyErasure coding (6+3) for most dataHigher read latency (must fetch k fragments and decode) and CPU overhead compared to simple replicationUse triple replication for hot, frequently-read small objects where read latency is critical; use a hybrid approach with lifecycle tiering
Metadata storeDynamoDB / Cassandra (NoSQL)No cross-partition transactions; eventual consistency for some operationsUse sharded PostgreSQL if your team has deep SQL expertise and the object count stays under 1 billion
Consistency modelStrong read-after-write for new PUTs; eventual for overwrites/deletesOverwrites may briefly serve stale dataInvest in strong consistency for all operations (like S3 did in 2020) if your use case cannot tolerate any staleness, at the cost of higher metadata coordination overhead
Upload strategySingle PUT < 100 MB; multipart aboveMultipart adds server-side state and complexity for large uploadsLower the threshold to 10 MB if clients are on unreliable networks with frequent disconnections
GC grace period72 hours before deleting unreferenced fragmentsDead data occupies disk for up to 3 days after deletionShorten to 24 hours if storage cost pressure is extreme, but only after verifying no uploads take longer than 24 hours to complete
Metadata-data separationTwo distinct services with different scaling propertiesAdded operational complexity — two services to deploy, monitor, and scaleA monolithic design is simpler for very small deployments (< 10 million objects) but hits a scaling ceiling quickly

What This Design Intentionally Excludes#

Production object storage systems include additional layers that are out of scope for this case study:

  • Cross-region replication — automatically copying objects to a geographically distant region for disaster recovery. Adds a replication pipeline, conflict resolution for concurrent writes to the same key in different regions, and bandwidth cost management.
  • Storage tiering and lifecycle management — automatically moving objects to cheaper storage classes (Infrequent Access, Glacier/Archive) based on access patterns. Requires access-frequency tracking, a policy engine, and a background data migration system.
  • Server-side encryption — encrypting objects at rest with customer-managed keys (SSE-KMS) or server-managed keys (SSE-S3). Adds key management infrastructure and per-request encryption/decryption overhead.
  • Event notifications — triggering Lambda functions or SQS messages when objects are created or deleted. Requires an event bus integrated with the metadata service.
  • Access control policies — bucket policies, IAM roles, and access control lists (ACLs) that govern who can read/write which objects. Adds a policy evaluation engine on the critical request path.

Each of these adds significant systems without fundamentally changing the core metadata-data separation architecture described here.

Summary#

ComponentDesign DecisionKey Reasoning
Data durabilityErasure coding (6+3) across independent failure domains1.5x storage overhead vs. 3x for replication; survives up to 3 simultaneous fragment losses; cost savings of millions of dollars per year at petabyte scale
Metadata storeDynamoDB / Cassandra with consistent hashingAutomatic partitioning handles billions of objects; composite key (bucket_id, object_key) enables efficient point lookups and prefix-filtered LIST operations
Metadata-data separationTwo distinct services with independent scalingMetadata is small and accessed on every request; data is large and accessed infrequently — optimizing both with a single architecture is impossible
Upload pathSingle PUT for small objects; multipart for large objects (> 100 MB)Multipart enables parallel upload, per-part retry, and resumability — a 5 GB upload over an unreliable connection becomes practical instead of impractical
Write orderingData written before metadata; garbage collector handles orphansDangling data is safe and recoverable; dangling metadata causes download failures — the system fails toward the safe side
ConsistencyRead-after-write for new PUTs; eventual for overwrites and deletesNew objects must be immediately readable after acknowledgment; overwrites and deletes converge within sub-second windows
Garbage collectionPeriodic scan with 72-hour grace periodConservative grace period prevents premature deletion of in-flight upload fragments; dead data cost is negligible relative to total storage
Fragment placementSpread across racks, power domains, and network switchesA single hardware failure — rack power loss, switch failure, disk death — must never destroy multiple fragments of the same object

Distributed file storage is one of the most foundational systems in cloud computing — nearly every other cloud service depends on it. The design process surfaces decisions that reappear across all distributed systems: how to achieve durability without prohibitive cost (erasure coding), how to separate concerns that scale differently (metadata vs. data), how to handle partial failures gracefully (write-ahead logging, garbage collection), and how to choose the right consistency guarantee for each operation. Master these patterns here, and you will recognize them in every distributed system you encounter.

Sources: