Storage: Choosing Where Your Data Lives

Every application stores data. The question is not whether to store it, but where, how, and why — because every storage technology makes a different set of trade-offs between consistency, scalability, query flexibility, and cost. Choosing the wrong one for the wrong workload doesn't break your app today, but it creates a ceiling on what your app can do tomorrow.

There are three fundamental categories of storage that appear in virtually every production system:

Relational databases (SQL): Structured data with relationships, transactions, and arbitrary queries — your source of truth.
NoSQL databases: A broad family — document databases (MongoDB) offer flexible schemas and rich queries; key-value and wide-column stores (DynamoDB, Cassandra) optimize for extreme write throughput on known access patterns.
Object storage: Cheap, durable, effectively unlimited storage for binary files — images, videos, backups, and AI model artifacts.

AI agents will generate database code readily, but they make specific, consistent choices that may not match your needs. They usually default to PostgreSQL for everything, store file bytes in the database instead of object storage, and rarely recommend NoSQL unless you explicitly ask. Understanding these three categories — and when each applies — is what allows you to set the right constraints before prompting, so the AI builds the data layer you actually need.

Relational Databases (SQL)#

A relational database organizes data into tables of rows and columns, with explicit relationships between tables enforced through foreign keys. The defining feature is ACID compliance — a set of guarantees that make relational databases the default choice for any data where correctness matters more than raw speed.

Relational Database Architecture

A relational database stores data in structured tables connected by foreign keys. The engine enforces constraints, runs complex JOIN queries across tables, and wraps multi-step writes in transactions that either fully commit or fully roll back.

Rendering diagram...

ACID: The Four Guarantees That Make SQL Trustworthy#

ACID is not a single feature — it is four separate guarantees that relational databases provide together. Understanding each one explains why SQL is the default for data where correctness matters.

Property	What It Guarantees	Example: Bank Transfer
Atomicity	All operations in a transaction succeed, or none of them do. No partial writes.	Debit from account A and credit to account B either both happen or neither does. A server crash mid-transfer leaves both accounts untouched.
Consistency	Every write moves the database from one valid state to another, honoring all constraints (foreign keys, NOT NULL, CHECK).	If account A has $100 and you try to debit $200, the CHECK constraint rejects the write. The database stays valid.
Isolation	Concurrent transactions don't interfere with each other. Each sees a consistent snapshot of the database.	Two simultaneous transfers from the same account can't both read the same balance, spend it twice, and corrupt the total. Each sees its own snapshot.
Durability	Once a transaction commits, it survives crashes. Data is written to disk before commit is confirmed.	After 'Transfer successful' appears, a power outage won't undo it. The committed change is in the write-ahead log on disk.

PostgreSQL implements isolation through MVCC (Multi-Version Concurrency Control): instead of locking rows, the database retains multiple historical versions of each row. When a transaction starts, it receives a consistent snapshot of the database as it existed at that moment, and reads from that snapshot for its entire lifetime — even while other transactions are writing new versions of the same rows. Readers never block writers, and writers never block readers. This is why PostgreSQL handles high-concurrency workloads efficiently without the lock contention you would see in a naive locking approach.

PostgreSQL vs MySQL#

Both are mature, production-proven relational databases. In practice, most new projects should start with PostgreSQL — the industry has largely converged on this recommendation. But the choice matters if you are extending an existing system.

	PostgreSQL	MySQL
ACID compliance	Full ACID by default via MVCC	Full ACID with InnoDB engine (default since 5.5)
Concurrency model	MVCC — readers never block writers	Gap locking under REPEATABLE READ — more contention under heavy concurrent writes
Index types	B-tree, GIN, GiST, BRIN, partial, expression indexes	B-tree, full-text; no partial or expression indexes
JSON support	Native JSONB — binary, indexable, queryable with operators	JSON — text-based; less efficient to query
Advanced features	Arrays, ranges, UUID, PostGIS, full-text search, pgvector (AI embeddings)	Simpler type system; fewer built-in extensions
Ecosystem	Spotify, Discord, Instagram, Uber, Reddit	WordPress, many PHP/LAMP stack apps
AI agent default	Yes — almost always recommended	Only when the existing system uses MySQL

The practical rule: Start with PostgreSQL. Its richer feature set, better MVCC implementation, and support for pgvector (which lets you run vector similarity search in the same database as your relational data, avoiding a separate vector DB) make it the correct default in 2026. Switch to MySQL only when you are extending an existing MySQL system or must integrate with a MySQL-specific ecosystem.

Indexes: How Queries Stay Fast#

An index is a separate data structure that the database maintains alongside your table to make specific queries faster. Without an index, finding a user by email means reading every row in the users table — a full table scan that takes O(n) time. With a B-tree index on email, the database jumps directly to the matching row in O(log n) time.

Without index: full table scan

Rendering diagram...

With B-tree index: direct lookup

Rendering diagram...

Indexes come at a cost: every INSERT, UPDATE, or DELETE must also update the index. Read-heavy tables benefit greatly from indexes; write-heavy tables pay a higher overhead for each index maintained.

The most common AI-generated index mistake: AI agents create foreign key columns (user_id, order_id) but frequently omit the index on them. When you run SELECT * FROM orders WHERE user_id = 42, PostgreSQL performs a full scan of the orders table unless there is an index on user_id. Always add indexes on every foreign key column and every column used in WHERE or ORDER BY clauses.

NoSQL Databases#

NoSQL databases — "Not Only SQL" — is a broad family of databases that trade the relational model's query flexibility for horizontal scalability and predictable low latency. The umbrella covers meaningfully different database types: key-value and wide-column stores (DynamoDB, Cassandra) that optimize for extreme write throughput on known access patterns, and document databases (MongoDB) that offer flexible schemas and richer queries while still scaling horizontally. What they share is the absence of fixed schemas, SQL joins, and the single-primary write bottleneck of relational databases.

The key-value and wide-column stores (DynamoDB, Cassandra) split data into partitions — logical chunks of records that live on the same physical node. A partition key determines which node holds each record. Add more nodes, get more partitions, and write capacity grows proportionally.

The critical insight that beginners often miss: NoSQL does not eliminate the need for data modeling. For key-value stores it shifts the burden from schema design to access pattern design — you must define every query before designing the model. Document databases like MongoDB relax this constraint with secondary indexes and aggregation pipelines, but schema design decisions (what to embed vs. reference) still have significant performance consequences.

NoSQL: Horizontal Scaling via Consistent Hashing

NoSQL databases distribute data across many nodes using a partition key. Each write goes to exactly the node responsible for that key's partition. Adding nodes increases capacity proportionally — without a single write bottleneck.

Rendering diagram...

DynamoDB vs Cassandra#

The two dominant NoSQL databases in production today take different approaches to the same problem. DynamoDB is fully managed by AWS — you provision capacity, and AWS handles everything else. Cassandra is open-source and self-managed (or available through managed services like DataStax Astra) — you own the cluster, but you also own the control.

	DynamoDB	Cassandra
Deployment	Fully managed (AWS only) — zero operational overhead	Self-managed cluster or managed service — significant ops burden
Data model	Key-value + document (items up to 400 KB)	Wide-column: partition key + multiple clustering columns
Partition key	Single attribute only	Can be composite (multiple columns combined)
Sort/Clustering key	Single attribute, one sort order	Multiple clustering columns, explicit sort directions per column
Consistency	Eventually consistent (default) or strongly consistent (request-level option)	Tunable per operation: ONE, QUORUM, ALL, LOCAL_QUORUM
Scaling	Automatic — AWS repartitions transparently	Manual — add nodes, data rebalances over time
Vendor lock-in	High — AWS-only, proprietary API	None — open source, runs anywhere
Write throughput	Excellent — auto-scaled	Excellent — LSM-tree (Log-Structured Merge-tree) storage buffers writes in memory and flushes to disk sequentially, making individual writes extremely fast
Best for	AWS-native teams, fast delivery, serverless architectures	Multi-cloud, extreme write scale, time-series data, when control matters
Real-world users	Netflix (A/B testing), Lyft, Airbnb, gaming companies	Instagram (feeds, fraud detection), Netflix (viewing history), IoT platforms

The partition key is the most important design decision in any NoSQL data model. It determines which physical node stores the record, and every query must include the partition key. If your query needs to filter by a field that isn't the partition key, you need a Global Secondary Index (DynamoDB) or a separate table with a different partition key (Cassandra). Getting the partition key wrong causes the hot partition problem: if the chosen field has very few unique values (for example, a status field with only "active" or "inactive"), the vast majority of writes land on the same one or two nodes while the rest sit idle — those nodes become a bottleneck, and the horizontal scaling advantage disappears entirely.

A common pattern in DynamoDB for e-commerce:

Partition key: USER#<userId>
Sort key: ORDER#<timestamp>

This co-locates all orders for a user on one partition and returns them in chronological order with a single query — no cross-partition scan needed.

MongoDB: Document Database#

MongoDB occupies a distinct position in the NoSQL landscape. While DynamoDB and Cassandra optimize for predictable key-based access at extreme write scale, MongoDB optimizes for developer flexibility and query richness — it gives you a document model with no fixed schema, secondary indexes on any field, and a rich aggregation pipeline for analytics, all without the rigid access-pattern-first discipline that DynamoDB requires. It sits between a relational database and a pure key-value store.

MongoDB stores data as BSON documents (Binary JSON) — field-value pairs with support for nested objects, arrays, and typed fields like Decimal128, ObjectId, and Date. Documents live in collections (analogous to tables) with no enforced schema: two documents in the same collection can have completely different fields. The document model maps directly to the objects in application code, eliminating the impedance mismatch between object-oriented code and relational tables.

MongoDB: Document Model with Auto-Sharding

MongoDB stores hierarchical BSON documents in collections. Related data is embedded directly in the document (avoiding joins). For horizontal scale, a mongos router distributes writes across shards using a shard key — each shard is itself a three-node replica set for high availability.

Rendering diagram...

DynamoDB vs Cassandra vs MongoDB#

The three dominant NoSQL databases cover different regions of the trade-off space. Choosing between them comes down to the shape of your data, the predictability of your access patterns, and the operational model your team can support.

	DynamoDB	Cassandra	MongoDB
Data model	Key-value + limited document (400 KB max)	Wide-column (partition key + clustering columns)	Flexible document — BSON, 16 MB max, nested objects and arrays
Schema	Schemaless per item	Schema-constrained at the column-family level	Fully schemaless per document — different documents in one collection can have different fields
Query model	Primary key + GSI only; no joins	CQL (SQL-like) but partition key required; no joins	Rich MQL — secondary indexes on any field, aggregation pipeline, $lookup, geospatial, full-text
ACID transactions	Single-item atomic; limited multi-item (up to 25 items)	No multi-row transactions; eventual consistency by default	Single-document always; multi-document ACID (v4.0+ replica sets, v4.2+ sharded)
Write throughput	Extremely high — auto-scaled, millions/sec	Extremely high — LSM-tree, optimized for sequential disk writes	High — competitive, but not the primary selling point
Horizontal scaling	Fully managed — AWS auto-repartitions	Masterless ring — add nodes, data rebalances	Auto-sharding via mongos — each shard is a replica set
Deployment	Fully managed, AWS-only	Self-managed or managed service (DataStax Astra)	Self-managed or MongoDB Atlas (AWS, GCP, Azure)
Vendor lock-in	High — AWS-only, proprietary API	None — Apache open source	Low — open source; Atlas is managed but not cloud-locked
Vector / AI search	No native support	No native support	Atlas Vector Search — embeddings stored alongside documents, hybrid semantic + metadata queries
Best for	AWS-native, serverless, key lookups at any scale, zero-ops teams	Massive write scale, time-series, multi-datacenter active-active	Flexible schemas, rapid iteration, hierarchical data, rich queries without fixed access patterns
Real-world users	Netflix (A/B testing), Lyft, Airbnb, gaming	Instagram, Netflix (viewing history), IoT platforms	eBay, Walmart, Tinder, The New York Times, Expedia, Uber

The practical rule: If your team is on AWS and all access patterns are known and key-based, DynamoDB's zero-ops model and extreme scale make it the right choice. If you need massive write scale across multiple data centers without cloud lock-in, Cassandra. If your schema is evolving, your data is hierarchical, or you need rich queries beyond key-based lookups — without moving to a full relational database — MongoDB is the strongest fit.

With SQL, you design the schema first and write queries later. With NoSQL, you invert this: design your queries first, then build the schema around them.

SQL: Schema first, queries follow

Rendering diagram...

NoSQL: Access patterns first, schema follows

Rendering diagram...

This is why AI agents generate poor NoSQL schemas without guidance: they apply relational thinking (normalize the schema first, write queries later) to a system that requires the opposite approach. Before prompting an AI agent to generate DynamoDB or Cassandra code, write out every query your application needs to run — including the fields you filter by and the sort order — and include that list explicitly in your prompt.

Object Storage#

Object storage manages data as self-contained blobs — images, videos, audio files, database backups, log archives, AI model weights, training datasets — in a flat namespace indexed by a unique key. It is the cheapest, most durable, and most scalable way to store files, and it is the correct choice for any data that does not need to be queried by its contents.

Object Storage: Direct Upload with Presigned URLs

The canonical pattern for file uploads: the client gets a time-limited presigned URL from your backend, then uploads directly to object storage. Your application server never touches the file bytes — reducing load, cost, and latency. The object is then served to end users via a CDN.

Rendering diagram...

Object Storage vs Block Storage vs File Storage#

Beginners often confuse the three storage types. They solve fundamentally different problems.

	Object Storage (S3, GCS)	Block Storage (EBS, GCE Persistent Disk)	File Storage (EFS, NFS)
Abstraction	Key-value blobs (files accessed via HTTP API)	Raw disk volumes mounted to a VM	Hierarchical filesystem mounted over network
Access method	HTTP GET/PUT requests	Attached as a disk — read/write like a local drive	Mounted as a network drive — standard file operations
Best for	Images, videos, backups, model artifacts, archives	Database volumes, OS disks, anything needing byte-level writes	Shared filesystems across multiple servers
Mutability	Immutable — replace the whole object	Fully mutable — byte-level reads and writes	Fully mutable
Scale	Effectively unlimited	Limited by volume size (up to a few tens of TB)	Limited
Cost (relative)	Cheapest — $0.02–0.023/GB/month	Most expensive — $0.08–0.10/GB/month	Mid-range
CDN integration	Native — built-in CloudFront/CDN origin support	Not applicable	Not applicable

The database BLOB antipattern: A very common AI-generated mistake is storing file bytes as BLOB (Binary Large OBject) columns in a relational database. This works for small prototypes but causes real production problems: the database swells in size (affecting backup times, restore times, and the amount of RAM needed for the buffer cache), files cannot be served directly via CDN, and every file download passes through your application server and database — making both the bottleneck. The correct pattern is always: store the file bytes in object storage, and store only the object key — a short string like uploads/users/42/avatar.jpg — in the database column. The database row stays small, and the file can be served directly from S3 through a CDN without touching your application server at all.

S3 vs Google Cloud Storage (GCS)#

Both offer the same 11-nines durability guarantee and very similar feature sets. The choice is primarily driven by your cloud provider.

	Amazon S3	Google Cloud Storage (GCS)
Standard storage price	~$0.023/GB/month (first 50 TB, then lower tiers)	~$0.020/GB/month — flat rate, slightly cheaper
Archive storage	Glacier Deep Archive: ~$0.001/GB/month (12–48 hours to retrieve — requires an explicit restore step)	Archive: ~$0.0012/GB/month (milliseconds to access — no rehydration wait, but retrieval fees apply)
CDN integration	CloudFront — very mature, global edge network	Cloud CDN — deeply integrated with GCP
AI/ML integration	Native with SageMaker, EMR, Athena	Native with Vertex AI, BigQuery — strongly preferred when using GCP AI services
Ecosystem	De facto standard — most tools default to S3-compatible APIs. Widest third-party support.	Excellent within GCP; S3-compatible API available but incomplete
Choose when	On AWS, or when using tools that assume S3	On GCP, especially when using BigQuery or Vertex AI for analytics/ML

MinIO is worth knowing about: it is an open-source S3-compatible object store that runs on your own hardware or any cloud. Used when regulatory requirements prevent using cloud-managed storage, or for local development environments that need to replicate S3 behavior without network calls.

Choosing the Right Storage: A Decision Framework#

Most production systems use multiple storage technologies together — a pattern called polyglot persistence. The decision of where each piece of data lives should be driven by the data's shape, how it is accessed, and what consistency guarantees it requires.

Storage Selection Decision Tree

Walk through this tree for each type of data in your system. Most applications end up using a combination of all three storage categories — the question is which data belongs where.

Rendering diagram...

The Most Common Architectural Pattern: SQL + Object Storage#

For applications with file uploads, the relational database and object storage always work together. The database stores metadata; object storage stores the bytes.

Antipattern: File bytes in database

Rendering diagram...

Correct pattern: Object key in database

Rendering diagram...

Prompting AI Agents for Storage#

AI agents make consistent, predictable storage decisions unless you constrain them upfront. Here is what they do by default — and what to specify instead.

What AI Does By Default	The Problem	What to Specify in Your Prompt
Recommends PostgreSQL for every data type, including file storage	Generates BLOB columns for file uploads — bloats the database and prevents CDN delivery	"Store file uploads in S3. Save only the S3 object key in PostgreSQL. Generate presigned URLs for uploads and downloads."
Never recommends NoSQL unless you ask	Misses cases where a session store, leaderboard, or high-throughput write workload clearly calls for DynamoDB or Cassandra — and misses cases where MongoDB's flexible document model is a better fit than forcing a relational schema	"Use DynamoDB for the session store. Partition key: session_id. TTL attribute for auto-expiry." Or: "Use MongoDB for the product catalog — schema varies by product category. Shard on category_id. Index on price and rating."
Opens a new database connection per request	Exhausts PostgreSQL's connection limit under load — PostgreSQL defaults to max_connections = 100, meaning the 101st simultaneous request is rejected entirely	"Use a connection pool (PgBouncer or the ORM's built-in pool). Never open a raw connection per request."
Creates foreign keys but omits indexes on them	Queries that JOIN on foreign keys perform full table scans — catastrophic on large tables	"Add a B-tree index on every foreign key column and every column used in WHERE or ORDER BY clauses."
Wraps single writes in transactions but not multi-step writes	Operations that update multiple tables (create order + decrement inventory) leave data inconsistent if one write fails	"Wrap every operation that modifies more than one table in an explicit transaction (BEGIN … COMMIT)."
Uses JSONB columns as a flexible escape hatch	JSONB queries on fields inside the document don't use standard B-tree indexes — requires GIN indexes or expression indexes for performance	"Use a proper relational column for any field you filter or sort by. Reserve JSONB for truly schemaless supplementary data."

Summary#

Storage Type	Use For	Key Constraint to Set for AI
PostgreSQL (SQL)	Structured relational data, transactions, anything where two writes must be atomic, complex ad-hoc queries	Specify indexes on all foreign keys and WHERE columns; require transactions for multi-table writes; require connection pooling
MongoDB (Document NoSQL)	Flexible / polymorphic schemas, hierarchical data, rapid iteration, rich queries without fixed access patterns, AI apps needing vector + operational data together	Decide what to embed vs. reference before prompting; specify shard key for collections above 10 GB; add indexes on all query fields; avoid unbounded arrays in documents
DynamoDB / Cassandra (Key-Value / Wide-Column NoSQL)	High-throughput key-value lookups, session stores, leaderboards, time-series data, write-heavy workloads at extreme scale	Specify all access patterns upfront; define partition key and sort key explicitly; specify consistency level; avoid monotonically increasing shard/partition keys
S3 / GCS (Object Storage)	Binary files: images, videos, backups, AI model weights, training datasets, static assets	Store only the object key in the database; use presigned URLs for uploads and downloads; serve via CDN
Polyglot persistence	Real-world systems that need multiple storage types — relational metadata + document store + file bytes + high-throughput lookups	Specify which data lives in which system before prompting; define who owns the reference between systems
AI agent bias	Defaults to PostgreSQL for everything; never recommends NoSQL or MongoDB; stores files in the database	Set storage architecture explicitly in your spec before generating any data layer code

The storage decision is one of the hardest to reverse. Migrating a production database from SQL to NoSQL, or extracting file bytes out of a BLOB column into S3, requires careful data migration, service changes, and often downtime — all after the fact. Set the storage architecture before you prompt, and the AI agent will generate implementation code that fits the system you actually need. Skip this step, and you may spend weeks undoing what the AI built in minutes.

Sources:

PreviousAPI Design

NextCaching & CDNs

DynamoDB vs MongoDB vs Cassandra for Geo-Distributed Apps — Yugabyte

Storage: Choosing Where Your Data Lives

Relational Database Architecture

NoSQL: Horizontal Scaling via Consistent Hashing

MongoDB: Document Model with Auto-Sharding

Object Storage: Direct Upload with Presigned URLs

Storage Selection Decision Tree

Arch Advisor