Storage: Choosing Where Your Data Lives

Every application stores data. The question is not whether to store it, but where, how, and why — because every storage technology makes a different set of trade-offs between consistency, scalability, query flexibility, and cost. Choosing the wrong one for the wrong workload doesn't break your app today, but it creates a ceiling on what your app can do tomorrow.

There are three fundamental categories of storage that appear in virtually every production system:

  • Relational databases (SQL): Structured data with relationships, transactions, and arbitrary queries — your source of truth.
  • NoSQL databases: A broad family — document databases (MongoDB) offer flexible schemas and rich queries; key-value and wide-column stores (DynamoDB, Cassandra) optimize for extreme write throughput on known access patterns.
  • Object storage: Cheap, durable, effectively unlimited storage for binary files — images, videos, backups, and AI model artifacts.

AI agents will generate database code readily, but they make specific, consistent choices that may not match your needs. They usually default to PostgreSQL for everything, store file bytes in the database instead of object storage, and rarely recommend NoSQL unless you explicitly ask. Understanding these three categories — and when each applies — is what allows you to set the right constraints before prompting, so the AI builds the data layer you actually need.

Relational Databases (SQL)#

A relational database organizes data into tables of rows and columns, with explicit relationships between tables enforced through foreign keys. The defining feature is ACID compliance — a set of guarantees that make relational databases the default choice for any data where correctness matters more than raw speed.

Relational Database Architecture

A relational database stores data in structured tables connected by foreign keys. The engine enforces constraints, runs complex JOIN queries across tables, and wraps multi-step writes in transactions that either fully commit or fully roll back.

Rendering diagram...

ACID: The Four Guarantees That Make SQL Trustworthy#

ACID is not a single feature — it is four separate guarantees that relational databases provide together. Understanding each one explains why SQL is the default for data where correctness matters.

PropertyWhat It GuaranteesExample: Bank Transfer
AtomicityAll operations in a transaction succeed, or none of them do. No partial writes.Debit from account A and credit to account B either both happen or neither does. A server crash mid-transfer leaves both accounts untouched.
ConsistencyEvery write moves the database from one valid state to another, honoring all constraints (foreign keys, NOT NULL, CHECK).If account A has $100 and you try to debit $200, the CHECK constraint rejects the write. The database stays valid.
IsolationConcurrent transactions don't interfere with each other. Each sees a consistent snapshot of the database.Two simultaneous transfers from the same account can't both read the same balance, spend it twice, and corrupt the total. Each sees its own snapshot.
DurabilityOnce a transaction commits, it survives crashes. Data is written to disk before commit is confirmed.After 'Transfer successful' appears, a power outage won't undo it. The committed change is in the write-ahead log on disk.

PostgreSQL implements isolation through MVCC (Multi-Version Concurrency Control): instead of locking rows, the database retains multiple historical versions of each row. When a transaction starts, it receives a consistent snapshot of the database as it existed at that moment, and reads from that snapshot for its entire lifetime — even while other transactions are writing new versions of the same rows. Readers never block writers, and writers never block readers. This is why PostgreSQL handles high-concurrency workloads efficiently without the lock contention you would see in a naive locking approach.

PostgreSQL vs MySQL#

Both are mature, production-proven relational databases. In practice, most new projects should start with PostgreSQL — the industry has largely converged on this recommendation. But the choice matters if you are extending an existing system.

PostgreSQLMySQL
ACID complianceFull ACID by default via MVCCFull ACID with InnoDB engine (default since 5.5)
Concurrency modelMVCC — readers never block writersGap locking under REPEATABLE READ — more contention under heavy concurrent writes
Index typesB-tree, GIN, GiST, BRIN, partial, expression indexesB-tree, full-text; no partial or expression indexes
JSON supportNative JSONB — binary, indexable, queryable with operatorsJSON — text-based; less efficient to query
Advanced featuresArrays, ranges, UUID, PostGIS, full-text search, pgvector (AI embeddings)Simpler type system; fewer built-in extensions
EcosystemSpotify, Discord, Instagram, Uber, RedditWordPress, many PHP/LAMP stack apps
AI agent defaultYes — almost always recommendedOnly when the existing system uses MySQL

The practical rule: Start with PostgreSQL. Its richer feature set, better MVCC implementation, and support for pgvector (which lets you run vector similarity search in the same database as your relational data, avoiding a separate vector DB) make it the correct default in 2026. Switch to MySQL only when you are extending an existing MySQL system or must integrate with a MySQL-specific ecosystem.

Indexes: How Queries Stay Fast#

An index is a separate data structure that the database maintains alongside your table to make specific queries faster. Without an index, finding a user by email means reading every row in the users table — a full table scan that takes O(n) time. With a B-tree index on email, the database jumps directly to the matching row in O(log n) time.

Without index: full table scan
Rendering diagram...
With B-tree index: direct lookup
Rendering diagram...

Indexes come at a cost: every INSERT, UPDATE, or DELETE must also update the index. Read-heavy tables benefit greatly from indexes; write-heavy tables pay a higher overhead for each index maintained.

The most common AI-generated index mistake: AI agents create foreign key columns (user_id, order_id) but frequently omit the index on them. When you run SELECT * FROM orders WHERE user_id = 42, PostgreSQL performs a full scan of the orders table unless there is an index on user_id. Always add indexes on every foreign key column and every column used in WHERE or ORDER BY clauses.

NoSQL Databases#

NoSQL databases — "Not Only SQL" — is a broad family of databases that trade the relational model's query flexibility for horizontal scalability and predictable low latency. The umbrella covers meaningfully different database types: key-value and wide-column stores (DynamoDB, Cassandra) that optimize for extreme write throughput on known access patterns, and document databases (MongoDB) that offer flexible schemas and richer queries while still scaling horizontally. What they share is the absence of fixed schemas, SQL joins, and the single-primary write bottleneck of relational databases.

The key-value and wide-column stores (DynamoDB, Cassandra) split data into partitions — logical chunks of records that live on the same physical node. A partition key determines which node holds each record. Add more nodes, get more partitions, and write capacity grows proportionally.

The critical insight that beginners often miss: NoSQL does not eliminate the need for data modeling. For key-value stores it shifts the burden from schema design to access pattern design — you must define every query before designing the model. Document databases like MongoDB relax this constraint with secondary indexes and aggregation pipelines, but schema design decisions (what to embed vs. reference) still have significant performance consequences.

NoSQL: Horizontal Scaling via Consistent Hashing

NoSQL databases distribute data across many nodes using a partition key. Each write goes to exactly the node responsible for that key's partition. Adding nodes increases capacity proportionally — without a single write bottleneck.

Rendering diagram...

DynamoDB vs Cassandra#

The two dominant NoSQL databases in production today take different approaches to the same problem. DynamoDB is fully managed by AWS — you provision capacity, and AWS handles everything else. Cassandra is open-source and self-managed (or available through managed services like DataStax Astra) — you own the cluster, but you also own the control.

DynamoDBCassandra
DeploymentFully managed (AWS only) — zero operational overheadSelf-managed cluster or managed service — significant ops burden
Data modelKey-value + document (items up to 400 KB)Wide-column: partition key + multiple clustering columns
Partition keySingle attribute onlyCan be composite (multiple columns combined)
Sort/Clustering keySingle attribute, one sort orderMultiple clustering columns, explicit sort directions per column
ConsistencyEventually consistent (default) or strongly consistent (request-level option)Tunable per operation: ONE, QUORUM, ALL, LOCAL_QUORUM
ScalingAutomatic — AWS repartitions transparentlyManual — add nodes, data rebalances over time
Vendor lock-inHigh — AWS-only, proprietary APINone — open source, runs anywhere
Write throughputExcellent — auto-scaledExcellent — LSM-tree (Log-Structured Merge-tree) storage buffers writes in memory and flushes to disk sequentially, making individual writes extremely fast
Best forAWS-native teams, fast delivery, serverless architecturesMulti-cloud, extreme write scale, time-series data, when control matters
Real-world usersNetflix (A/B testing), Lyft, Airbnb, gaming companiesInstagram (feeds, fraud detection), Netflix (viewing history), IoT platforms

The partition key is the most important design decision in any NoSQL data model. It determines which physical node stores the record, and every query must include the partition key. If your query needs to filter by a field that isn't the partition key, you need a Global Secondary Index (DynamoDB) or a separate table with a different partition key (Cassandra). Getting the partition key wrong causes the hot partition problem: if the chosen field has very few unique values (for example, a status field with only "active" or "inactive"), the vast majority of writes land on the same one or two nodes while the rest sit idle — those nodes become a bottleneck, and the horizontal scaling advantage disappears entirely.

A common pattern in DynamoDB for e-commerce:

  • Partition key: USER#<userId>
  • Sort key: ORDER#<timestamp>

This co-locates all orders for a user on one partition and returns them in chronological order with a single query — no cross-partition scan needed.

MongoDB: Document Database#

MongoDB occupies a distinct position in the NoSQL landscape. While DynamoDB and Cassandra optimize for predictable key-based access at extreme write scale, MongoDB optimizes for developer flexibility and query richness — it gives you a document model with no fixed schema, secondary indexes on any field, and a rich aggregation pipeline for analytics, all without the rigid access-pattern-first discipline that DynamoDB requires. It sits between a relational database and a pure key-value store.

MongoDB stores data as BSON documents (Binary JSON) — field-value pairs with support for nested objects, arrays, and typed fields like Decimal128, ObjectId, and Date. Documents live in collections (analogous to tables) with no enforced schema: two documents in the same collection can have completely different fields. The document model maps directly to the objects in application code, eliminating the impedance mismatch between object-oriented code and relational tables.

MongoDB: Document Model with Auto-Sharding

MongoDB stores hierarchical BSON documents in collections. Related data is embedded directly in the document (avoiding joins). For horizontal scale, a mongos router distributes writes across shards using a shard key — each shard is itself a three-node replica set for high availability.

Rendering diagram...

DynamoDB vs Cassandra vs MongoDB#

The three dominant NoSQL databases cover different regions of the trade-off space. Choosing between them comes down to the shape of your data, the predictability of your access patterns, and the operational model your team can support.

DynamoDBCassandraMongoDB
Data modelKey-value + limited document (400 KB max)Wide-column (partition key + clustering columns)Flexible document — BSON, 16 MB max, nested objects and arrays
SchemaSchemaless per itemSchema-constrained at the column-family levelFully schemaless per document — different documents in one collection can have different fields
Query modelPrimary key + GSI only; no joinsCQL (SQL-like) but partition key required; no joinsRich MQL — secondary indexes on any field, aggregation pipeline, $lookup, geospatial, full-text
ACID transactionsSingle-item atomic; limited multi-item (up to 25 items)No multi-row transactions; eventual consistency by defaultSingle-document always; multi-document ACID (v4.0+ replica sets, v4.2+ sharded)
Write throughputExtremely high — auto-scaled, millions/secExtremely high — LSM-tree, optimized for sequential disk writesHigh — competitive, but not the primary selling point
Horizontal scalingFully managed — AWS auto-repartitionsMasterless ring — add nodes, data rebalancesAuto-sharding via mongos — each shard is a replica set
DeploymentFully managed, AWS-onlySelf-managed or managed service (DataStax Astra)Self-managed or MongoDB Atlas (AWS, GCP, Azure)
Vendor lock-inHigh — AWS-only, proprietary APINone — Apache open sourceLow — open source; Atlas is managed but not cloud-locked
Vector / AI searchNo native supportNo native supportAtlas Vector Search — embeddings stored alongside documents, hybrid semantic + metadata queries
Best forAWS-native, serverless, key lookups at any scale, zero-ops teamsMassive write scale, time-series, multi-datacenter active-activeFlexible schemas, rapid iteration, hierarchical data, rich queries without fixed access patterns
Real-world usersNetflix (A/B testing), Lyft, Airbnb, gamingInstagram, Netflix (viewing history), IoT platformseBay, Walmart, Tinder, The New York Times, Expedia, Uber

The practical rule: If your team is on AWS and all access patterns are known and key-based, DynamoDB's zero-ops model and extreme scale make it the right choice. If you need massive write scale across multiple data centers without cloud lock-in, Cassandra. If your schema is evolving, your data is hierarchical, or you need rich queries beyond key-based lookups — without moving to a full relational database — MongoDB is the strongest fit.

With SQL, you design the schema first and write queries later. With NoSQL, you invert this: design your queries first, then build the schema around them.

SQL: Schema first, queries follow
Rendering diagram...
NoSQL: Access patterns first, schema follows
Rendering diagram...

This is why AI agents generate poor NoSQL schemas without guidance: they apply relational thinking (normalize the schema first, write queries later) to a system that requires the opposite approach. Before prompting an AI agent to generate DynamoDB or Cassandra code, write out every query your application needs to run — including the fields you filter by and the sort order — and include that list explicitly in your prompt.

Object Storage#

Object storage manages data as self-contained blobs — images, videos, audio files, database backups, log archives, AI model weights, training datasets — in a flat namespace indexed by a unique key. It is the cheapest, most durable, and most scalable way to store files, and it is the correct choice for any data that does not need to be queried by its contents.

Object Storage: Direct Upload with Presigned URLs

The canonical pattern for file uploads: the client gets a time-limited presigned URL from your backend, then uploads directly to object storage. Your application server never touches the file bytes — reducing load, cost, and latency. The object is then served to end users via a CDN.

Rendering diagram...

Object Storage vs Block Storage vs File Storage#

Beginners often confuse the three storage types. They solve fundamentally different problems.

Object Storage (S3, GCS)Block Storage (EBS, GCE Persistent Disk)File Storage (EFS, NFS)
AbstractionKey-value blobs (files accessed via HTTP API)Raw disk volumes mounted to a VMHierarchical filesystem mounted over network
Access methodHTTP GET/PUT requestsAttached as a disk — read/write like a local driveMounted as a network drive — standard file operations
Best forImages, videos, backups, model artifacts, archivesDatabase volumes, OS disks, anything needing byte-level writesShared filesystems across multiple servers
MutabilityImmutable — replace the whole objectFully mutable — byte-level reads and writesFully mutable
ScaleEffectively unlimitedLimited by volume size (up to a few tens of TB)Limited
Cost (relative)Cheapest — $0.02–0.023/GB/monthMost expensive — $0.08–0.10/GB/monthMid-range
CDN integrationNative — built-in CloudFront/CDN origin supportNot applicableNot applicable

The database BLOB antipattern: A very common AI-generated mistake is storing file bytes as BLOB (Binary Large OBject) columns in a relational database. This works for small prototypes but causes real production problems: the database swells in size (affecting backup times, restore times, and the amount of RAM needed for the buffer cache), files cannot be served directly via CDN, and every file download passes through your application server and database — making both the bottleneck. The correct pattern is always: store the file bytes in object storage, and store only the object key — a short string like uploads/users/42/avatar.jpg — in the database column. The database row stays small, and the file can be served directly from S3 through a CDN without touching your application server at all.

S3 vs Google Cloud Storage (GCS)#

Both offer the same 11-nines durability guarantee and very similar feature sets. The choice is primarily driven by your cloud provider.

Amazon S3Google Cloud Storage (GCS)
Standard storage price~$0.023/GB/month (first 50 TB, then lower tiers)~$0.020/GB/month — flat rate, slightly cheaper
Archive storageGlacier Deep Archive: ~$0.001/GB/month (12–48 hours to retrieve — requires an explicit restore step)Archive: ~$0.0012/GB/month (milliseconds to access — no rehydration wait, but retrieval fees apply)
CDN integrationCloudFront — very mature, global edge networkCloud CDN — deeply integrated with GCP
AI/ML integrationNative with SageMaker, EMR, AthenaNative with Vertex AI, BigQuery — strongly preferred when using GCP AI services
EcosystemDe facto standard — most tools default to S3-compatible APIs. Widest third-party support.Excellent within GCP; S3-compatible API available but incomplete
Choose whenOn AWS, or when using tools that assume S3On GCP, especially when using BigQuery or Vertex AI for analytics/ML

MinIO is worth knowing about: it is an open-source S3-compatible object store that runs on your own hardware or any cloud. Used when regulatory requirements prevent using cloud-managed storage, or for local development environments that need to replicate S3 behavior without network calls.

Choosing the Right Storage: A Decision Framework#

Most production systems use multiple storage technologies together — a pattern called polyglot persistence. The decision of where each piece of data lives should be driven by the data's shape, how it is accessed, and what consistency guarantees it requires.

Storage Selection Decision Tree

Walk through this tree for each type of data in your system. Most applications end up using a combination of all three storage categories — the question is which data belongs where.

Rendering diagram...

The Most Common Architectural Pattern: SQL + Object Storage#

For applications with file uploads, the relational database and object storage always work together. The database stores metadata; object storage stores the bytes.

Antipattern: File bytes in database
Rendering diagram...
Correct pattern: Object key in database
Rendering diagram...

Prompting AI Agents for Storage#

AI agents make consistent, predictable storage decisions unless you constrain them upfront. Here is what they do by default — and what to specify instead.

What AI Does By DefaultThe ProblemWhat to Specify in Your Prompt
Recommends PostgreSQL for every data type, including file storageGenerates BLOB columns for file uploads — bloats the database and prevents CDN delivery"Store file uploads in S3. Save only the S3 object key in PostgreSQL. Generate presigned URLs for uploads and downloads."
Never recommends NoSQL unless you askMisses cases where a session store, leaderboard, or high-throughput write workload clearly calls for DynamoDB or Cassandra — and misses cases where MongoDB's flexible document model is a better fit than forcing a relational schema"Use DynamoDB for the session store. Partition key: session_id. TTL attribute for auto-expiry." Or: "Use MongoDB for the product catalog — schema varies by product category. Shard on category_id. Index on price and rating."
Opens a new database connection per requestExhausts PostgreSQL's connection limit under load — PostgreSQL defaults to max_connections = 100, meaning the 101st simultaneous request is rejected entirely"Use a connection pool (PgBouncer or the ORM's built-in pool). Never open a raw connection per request."
Creates foreign keys but omits indexes on themQueries that JOIN on foreign keys perform full table scans — catastrophic on large tables"Add a B-tree index on every foreign key column and every column used in WHERE or ORDER BY clauses."
Wraps single writes in transactions but not multi-step writesOperations that update multiple tables (create order + decrement inventory) leave data inconsistent if one write fails"Wrap every operation that modifies more than one table in an explicit transaction (BEGIN … COMMIT)."
Uses JSONB columns as a flexible escape hatchJSONB queries on fields inside the document don't use standard B-tree indexes — requires GIN indexes or expression indexes for performance"Use a proper relational column for any field you filter or sort by. Reserve JSONB for truly schemaless supplementary data."

Summary#

Storage TypeUse ForKey Constraint to Set for AI
PostgreSQL (SQL)Structured relational data, transactions, anything where two writes must be atomic, complex ad-hoc queriesSpecify indexes on all foreign keys and WHERE columns; require transactions for multi-table writes; require connection pooling
MongoDB (Document NoSQL)Flexible / polymorphic schemas, hierarchical data, rapid iteration, rich queries without fixed access patterns, AI apps needing vector + operational data togetherDecide what to embed vs. reference before prompting; specify shard key for collections above 10 GB; add indexes on all query fields; avoid unbounded arrays in documents
DynamoDB / Cassandra (Key-Value / Wide-Column NoSQL)High-throughput key-value lookups, session stores, leaderboards, time-series data, write-heavy workloads at extreme scaleSpecify all access patterns upfront; define partition key and sort key explicitly; specify consistency level; avoid monotonically increasing shard/partition keys
S3 / GCS (Object Storage)Binary files: images, videos, backups, AI model weights, training datasets, static assetsStore only the object key in the database; use presigned URLs for uploads and downloads; serve via CDN
Polyglot persistenceReal-world systems that need multiple storage types — relational metadata + document store + file bytes + high-throughput lookupsSpecify which data lives in which system before prompting; define who owns the reference between systems
AI agent biasDefaults to PostgreSQL for everything; never recommends NoSQL or MongoDB; stores files in the databaseSet storage architecture explicitly in your spec before generating any data layer code

The storage decision is one of the hardest to reverse. Migrating a production database from SQL to NoSQL, or extracting file bytes out of a BLOB column into S3, requires careful data migration, service changes, and often downtime — all after the fact. Set the storage architecture before you prompt, and the AI agent will generate implementation code that fits the system you actually need. Skip this step, and you may spend weeks undoing what the AI built in minutes.

Sources: