Storage: Choosing Where Your Data Lives
Every application stores data. The question is not whether to store it, but where, how, and why — because every storage technology makes a different set of trade-offs between consistency, scalability, query flexibility, and cost. Choosing the wrong one for the wrong workload doesn't break your app today, but it creates a ceiling on what your app can do tomorrow.
There are three fundamental categories of storage that appear in virtually every production system:
- Relational databases (SQL): Structured data with relationships, transactions, and arbitrary queries — your source of truth.
- NoSQL databases: A broad family — document databases (MongoDB) offer flexible schemas and rich queries; key-value and wide-column stores (DynamoDB, Cassandra) optimize for extreme write throughput on known access patterns.
- Object storage: Cheap, durable, effectively unlimited storage for binary files — images, videos, backups, and AI model artifacts.
AI agents will generate database code readily, but they make specific, consistent choices that may not match your needs. They usually default to PostgreSQL for everything, store file bytes in the database instead of object storage, and rarely recommend NoSQL unless you explicitly ask. Understanding these three categories — and when each applies — is what allows you to set the right constraints before prompting, so the AI builds the data layer you actually need.
Relational Databases (SQL)#
A relational database organizes data into tables of rows and columns, with explicit relationships between tables enforced through foreign keys. The defining feature is ACID compliance — a set of guarantees that make relational databases the default choice for any data where correctness matters more than raw speed.
Relational Database Architecture
A relational database stores data in structured tables connected by foreign keys. The engine enforces constraints, runs complex JOIN queries across tables, and wraps multi-step writes in transactions that either fully commit or fully roll back.
ACID: The Four Guarantees That Make SQL Trustworthy#
ACID is not a single feature — it is four separate guarantees that relational databases provide together. Understanding each one explains why SQL is the default for data where correctness matters.
| Property | What It Guarantees | Example: Bank Transfer |
|---|---|---|
| Atomicity | All operations in a transaction succeed, or none of them do. No partial writes. | Debit from account A and credit to account B either both happen or neither does. A server crash mid-transfer leaves both accounts untouched. |
| Consistency | Every write moves the database from one valid state to another, honoring all constraints (foreign keys, NOT NULL, CHECK). | If account A has $100 and you try to debit $200, the CHECK constraint rejects the write. The database stays valid. |
| Isolation | Concurrent transactions don't interfere with each other. Each sees a consistent snapshot of the database. | Two simultaneous transfers from the same account can't both read the same balance, spend it twice, and corrupt the total. Each sees its own snapshot. |
| Durability | Once a transaction commits, it survives crashes. Data is written to disk before commit is confirmed. | After 'Transfer successful' appears, a power outage won't undo it. The committed change is in the write-ahead log on disk. |
PostgreSQL implements isolation through MVCC (Multi-Version Concurrency Control): instead of locking rows, the database retains multiple historical versions of each row. When a transaction starts, it receives a consistent snapshot of the database as it existed at that moment, and reads from that snapshot for its entire lifetime — even while other transactions are writing new versions of the same rows. Readers never block writers, and writers never block readers. This is why PostgreSQL handles high-concurrency workloads efficiently without the lock contention you would see in a naive locking approach.
PostgreSQL vs MySQL#
Both are mature, production-proven relational databases. In practice, most new projects should start with PostgreSQL — the industry has largely converged on this recommendation. But the choice matters if you are extending an existing system.
| PostgreSQL | MySQL | |
|---|---|---|
| ACID compliance | Full ACID by default via MVCC | Full ACID with InnoDB engine (default since 5.5) |
| Concurrency model | MVCC — readers never block writers | Gap locking under REPEATABLE READ — more contention under heavy concurrent writes |
| Index types | B-tree, GIN, GiST, BRIN, partial, expression indexes | B-tree, full-text; no partial or expression indexes |
| JSON support | Native JSONB — binary, indexable, queryable with operators | JSON — text-based; less efficient to query |
| Advanced features | Arrays, ranges, UUID, PostGIS, full-text search, pgvector (AI embeddings) | Simpler type system; fewer built-in extensions |
| Ecosystem | Spotify, Discord, Instagram, Uber, Reddit | WordPress, many PHP/LAMP stack apps |
| AI agent default | Yes — almost always recommended | Only when the existing system uses MySQL |
The practical rule: Start with PostgreSQL. Its richer feature set, better MVCC implementation, and support for pgvector (which lets you run vector similarity search in the same database as your relational data, avoiding a separate vector DB) make it the correct default in 2026. Switch to MySQL only when you are extending an existing MySQL system or must integrate with a MySQL-specific ecosystem.
Indexes: How Queries Stay Fast#
An index is a separate data structure that the database maintains alongside your table to make specific queries faster. Without an index, finding a user by email means reading every row in the users table — a full table scan that takes O(n) time. With a B-tree index on email, the database jumps directly to the matching row in O(log n) time.
Indexes come at a cost: every INSERT, UPDATE, or DELETE must also update the index. Read-heavy tables benefit greatly from indexes; write-heavy tables pay a higher overhead for each index maintained.
The most common AI-generated index mistake: AI agents create foreign key columns (user_id, order_id) but frequently omit the index on them. When you run SELECT * FROM orders WHERE user_id = 42, PostgreSQL performs a full scan of the orders table unless there is an index on user_id. Always add indexes on every foreign key column and every column used in WHERE or ORDER BY clauses.
NoSQL Databases#
NoSQL databases — "Not Only SQL" — is a broad family of databases that trade the relational model's query flexibility for horizontal scalability and predictable low latency. The umbrella covers meaningfully different database types: key-value and wide-column stores (DynamoDB, Cassandra) that optimize for extreme write throughput on known access patterns, and document databases (MongoDB) that offer flexible schemas and richer queries while still scaling horizontally. What they share is the absence of fixed schemas, SQL joins, and the single-primary write bottleneck of relational databases.
The key-value and wide-column stores (DynamoDB, Cassandra) split data into partitions — logical chunks of records that live on the same physical node. A partition key determines which node holds each record. Add more nodes, get more partitions, and write capacity grows proportionally.
The critical insight that beginners often miss: NoSQL does not eliminate the need for data modeling. For key-value stores it shifts the burden from schema design to access pattern design — you must define every query before designing the model. Document databases like MongoDB relax this constraint with secondary indexes and aggregation pipelines, but schema design decisions (what to embed vs. reference) still have significant performance consequences.
NoSQL: Horizontal Scaling via Consistent Hashing
NoSQL databases distribute data across many nodes using a partition key. Each write goes to exactly the node responsible for that key's partition. Adding nodes increases capacity proportionally — without a single write bottleneck.
DynamoDB vs Cassandra#
The two dominant NoSQL databases in production today take different approaches to the same problem. DynamoDB is fully managed by AWS — you provision capacity, and AWS handles everything else. Cassandra is open-source and self-managed (or available through managed services like DataStax Astra) — you own the cluster, but you also own the control.
| DynamoDB | Cassandra | |
|---|---|---|
| Deployment | Fully managed (AWS only) — zero operational overhead | Self-managed cluster or managed service — significant ops burden |
| Data model | Key-value + document (items up to 400 KB) | Wide-column: partition key + multiple clustering columns |
| Partition key | Single attribute only | Can be composite (multiple columns combined) |
| Sort/Clustering key | Single attribute, one sort order | Multiple clustering columns, explicit sort directions per column |
| Consistency | Eventually consistent (default) or strongly consistent (request-level option) | Tunable per operation: ONE, QUORUM, ALL, LOCAL_QUORUM |
| Scaling | Automatic — AWS repartitions transparently | Manual — add nodes, data rebalances over time |
| Vendor lock-in | High — AWS-only, proprietary API | None — open source, runs anywhere |
| Write throughput | Excellent — auto-scaled | Excellent — LSM-tree (Log-Structured Merge-tree) storage buffers writes in memory and flushes to disk sequentially, making individual writes extremely fast |
| Best for | AWS-native teams, fast delivery, serverless architectures | Multi-cloud, extreme write scale, time-series data, when control matters |
| Real-world users | Netflix (A/B testing), Lyft, Airbnb, gaming companies | Instagram (feeds, fraud detection), Netflix (viewing history), IoT platforms |
The partition key is the most important design decision in any NoSQL data model. It determines which physical node stores the record, and every query must include the partition key. If your query needs to filter by a field that isn't the partition key, you need a Global Secondary Index (DynamoDB) or a separate table with a different partition key (Cassandra). Getting the partition key wrong causes the hot partition problem: if the chosen field has very few unique values (for example, a status field with only "active" or "inactive"), the vast majority of writes land on the same one or two nodes while the rest sit idle — those nodes become a bottleneck, and the horizontal scaling advantage disappears entirely.
A common pattern in DynamoDB for e-commerce:
- Partition key:
USER#<userId> - Sort key:
ORDER#<timestamp>
This co-locates all orders for a user on one partition and returns them in chronological order with a single query — no cross-partition scan needed.
MongoDB: Document Database#
MongoDB occupies a distinct position in the NoSQL landscape. While DynamoDB and Cassandra optimize for predictable key-based access at extreme write scale, MongoDB optimizes for developer flexibility and query richness — it gives you a document model with no fixed schema, secondary indexes on any field, and a rich aggregation pipeline for analytics, all without the rigid access-pattern-first discipline that DynamoDB requires. It sits between a relational database and a pure key-value store.
MongoDB stores data as BSON documents (Binary JSON) — field-value pairs with support for nested objects, arrays, and typed fields like Decimal128, ObjectId, and Date. Documents live in collections (analogous to tables) with no enforced schema: two documents in the same collection can have completely different fields. The document model maps directly to the objects in application code, eliminating the impedance mismatch between object-oriented code and relational tables.
MongoDB: Document Model with Auto-Sharding
MongoDB stores hierarchical BSON documents in collections. Related data is embedded directly in the document (avoiding joins). For horizontal scale, a mongos router distributes writes across shards using a shard key — each shard is itself a three-node replica set for high availability.
DynamoDB vs Cassandra vs MongoDB#
The three dominant NoSQL databases cover different regions of the trade-off space. Choosing between them comes down to the shape of your data, the predictability of your access patterns, and the operational model your team can support.
| DynamoDB | Cassandra | MongoDB | |
|---|---|---|---|
| Data model | Key-value + limited document (400 KB max) | Wide-column (partition key + clustering columns) | Flexible document — BSON, 16 MB max, nested objects and arrays |
| Schema | Schemaless per item | Schema-constrained at the column-family level | Fully schemaless per document — different documents in one collection can have different fields |
| Query model | Primary key + GSI only; no joins | CQL (SQL-like) but partition key required; no joins | Rich MQL — secondary indexes on any field, aggregation pipeline, $lookup, geospatial, full-text |
| ACID transactions | Single-item atomic; limited multi-item (up to 25 items) | No multi-row transactions; eventual consistency by default | Single-document always; multi-document ACID (v4.0+ replica sets, v4.2+ sharded) |
| Write throughput | Extremely high — auto-scaled, millions/sec | Extremely high — LSM-tree, optimized for sequential disk writes | High — competitive, but not the primary selling point |
| Horizontal scaling | Fully managed — AWS auto-repartitions | Masterless ring — add nodes, data rebalances | Auto-sharding via mongos — each shard is a replica set |
| Deployment | Fully managed, AWS-only | Self-managed or managed service (DataStax Astra) | Self-managed or MongoDB Atlas (AWS, GCP, Azure) |
| Vendor lock-in | High — AWS-only, proprietary API | None — Apache open source | Low — open source; Atlas is managed but not cloud-locked |
| Vector / AI search | No native support | No native support | Atlas Vector Search — embeddings stored alongside documents, hybrid semantic + metadata queries |
| Best for | AWS-native, serverless, key lookups at any scale, zero-ops teams | Massive write scale, time-series, multi-datacenter active-active | Flexible schemas, rapid iteration, hierarchical data, rich queries without fixed access patterns |
| Real-world users | Netflix (A/B testing), Lyft, Airbnb, gaming | Instagram, Netflix (viewing history), IoT platforms | eBay, Walmart, Tinder, The New York Times, Expedia, Uber |
The practical rule: If your team is on AWS and all access patterns are known and key-based, DynamoDB's zero-ops model and extreme scale make it the right choice. If you need massive write scale across multiple data centers without cloud lock-in, Cassandra. If your schema is evolving, your data is hierarchical, or you need rich queries beyond key-based lookups — without moving to a full relational database — MongoDB is the strongest fit.
With SQL, you design the schema first and write queries later. With NoSQL, you invert this: design your queries first, then build the schema around them.
This is why AI agents generate poor NoSQL schemas without guidance: they apply relational thinking (normalize the schema first, write queries later) to a system that requires the opposite approach. Before prompting an AI agent to generate DynamoDB or Cassandra code, write out every query your application needs to run — including the fields you filter by and the sort order — and include that list explicitly in your prompt.
Object Storage#
Object storage manages data as self-contained blobs — images, videos, audio files, database backups, log archives, AI model weights, training datasets — in a flat namespace indexed by a unique key. It is the cheapest, most durable, and most scalable way to store files, and it is the correct choice for any data that does not need to be queried by its contents.
Object Storage: Direct Upload with Presigned URLs
The canonical pattern for file uploads: the client gets a time-limited presigned URL from your backend, then uploads directly to object storage. Your application server never touches the file bytes — reducing load, cost, and latency. The object is then served to end users via a CDN.
Object Storage vs Block Storage vs File Storage#
Beginners often confuse the three storage types. They solve fundamentally different problems.
| Object Storage (S3, GCS) | Block Storage (EBS, GCE Persistent Disk) | File Storage (EFS, NFS) | |
|---|---|---|---|
| Abstraction | Key-value blobs (files accessed via HTTP API) | Raw disk volumes mounted to a VM | Hierarchical filesystem mounted over network |
| Access method | HTTP GET/PUT requests | Attached as a disk — read/write like a local drive | Mounted as a network drive — standard file operations |
| Best for | Images, videos, backups, model artifacts, archives | Database volumes, OS disks, anything needing byte-level writes | Shared filesystems across multiple servers |
| Mutability | Immutable — replace the whole object | Fully mutable — byte-level reads and writes | Fully mutable |
| Scale | Effectively unlimited | Limited by volume size (up to a few tens of TB) | Limited |
| Cost (relative) | Cheapest — $0.02–0.023/GB/month | Most expensive — $0.08–0.10/GB/month | Mid-range |
| CDN integration | Native — built-in CloudFront/CDN origin support | Not applicable | Not applicable |
The database BLOB antipattern: A very common AI-generated mistake is storing file bytes as BLOB (Binary Large OBject) columns in a relational database. This works for small prototypes but causes real production problems: the database swells in size (affecting backup times, restore times, and the amount of RAM needed for the buffer cache), files cannot be served directly via CDN, and every file download passes through your application server and database — making both the bottleneck. The correct pattern is always: store the file bytes in object storage, and store only the object key — a short string like uploads/users/42/avatar.jpg — in the database column. The database row stays small, and the file can be served directly from S3 through a CDN without touching your application server at all.
S3 vs Google Cloud Storage (GCS)#
Both offer the same 11-nines durability guarantee and very similar feature sets. The choice is primarily driven by your cloud provider.
| Amazon S3 | Google Cloud Storage (GCS) | |
|---|---|---|
| Standard storage price | ~$0.023/GB/month (first 50 TB, then lower tiers) | ~$0.020/GB/month — flat rate, slightly cheaper |
| Archive storage | Glacier Deep Archive: ~$0.001/GB/month (12–48 hours to retrieve — requires an explicit restore step) | Archive: ~$0.0012/GB/month (milliseconds to access — no rehydration wait, but retrieval fees apply) |
| CDN integration | CloudFront — very mature, global edge network | Cloud CDN — deeply integrated with GCP |
| AI/ML integration | Native with SageMaker, EMR, Athena | Native with Vertex AI, BigQuery — strongly preferred when using GCP AI services |
| Ecosystem | De facto standard — most tools default to S3-compatible APIs. Widest third-party support. | Excellent within GCP; S3-compatible API available but incomplete |
| Choose when | On AWS, or when using tools that assume S3 | On GCP, especially when using BigQuery or Vertex AI for analytics/ML |
MinIO is worth knowing about: it is an open-source S3-compatible object store that runs on your own hardware or any cloud. Used when regulatory requirements prevent using cloud-managed storage, or for local development environments that need to replicate S3 behavior without network calls.
Choosing the Right Storage: A Decision Framework#
Most production systems use multiple storage technologies together — a pattern called polyglot persistence. The decision of where each piece of data lives should be driven by the data's shape, how it is accessed, and what consistency guarantees it requires.
Storage Selection Decision Tree
Walk through this tree for each type of data in your system. Most applications end up using a combination of all three storage categories — the question is which data belongs where.
The Most Common Architectural Pattern: SQL + Object Storage#
For applications with file uploads, the relational database and object storage always work together. The database stores metadata; object storage stores the bytes.
Prompting AI Agents for Storage#
AI agents make consistent, predictable storage decisions unless you constrain them upfront. Here is what they do by default — and what to specify instead.
| What AI Does By Default | The Problem | What to Specify in Your Prompt |
|---|---|---|
| Recommends PostgreSQL for every data type, including file storage | Generates BLOB columns for file uploads — bloats the database and prevents CDN delivery | "Store file uploads in S3. Save only the S3 object key in PostgreSQL. Generate presigned URLs for uploads and downloads." |
| Never recommends NoSQL unless you ask | Misses cases where a session store, leaderboard, or high-throughput write workload clearly calls for DynamoDB or Cassandra — and misses cases where MongoDB's flexible document model is a better fit than forcing a relational schema | "Use DynamoDB for the session store. Partition key: session_id. TTL attribute for auto-expiry." Or: "Use MongoDB for the product catalog — schema varies by product category. Shard on category_id. Index on price and rating." |
| Opens a new database connection per request | Exhausts PostgreSQL's connection limit under load — PostgreSQL defaults to max_connections = 100, meaning the 101st simultaneous request is rejected entirely | "Use a connection pool (PgBouncer or the ORM's built-in pool). Never open a raw connection per request." |
| Creates foreign keys but omits indexes on them | Queries that JOIN on foreign keys perform full table scans — catastrophic on large tables | "Add a B-tree index on every foreign key column and every column used in WHERE or ORDER BY clauses." |
| Wraps single writes in transactions but not multi-step writes | Operations that update multiple tables (create order + decrement inventory) leave data inconsistent if one write fails | "Wrap every operation that modifies more than one table in an explicit transaction (BEGIN … COMMIT)." |
| Uses JSONB columns as a flexible escape hatch | JSONB queries on fields inside the document don't use standard B-tree indexes — requires GIN indexes or expression indexes for performance | "Use a proper relational column for any field you filter or sort by. Reserve JSONB for truly schemaless supplementary data." |
Summary#
| Storage Type | Use For | Key Constraint to Set for AI |
|---|---|---|
| PostgreSQL (SQL) | Structured relational data, transactions, anything where two writes must be atomic, complex ad-hoc queries | Specify indexes on all foreign keys and WHERE columns; require transactions for multi-table writes; require connection pooling |
| MongoDB (Document NoSQL) | Flexible / polymorphic schemas, hierarchical data, rapid iteration, rich queries without fixed access patterns, AI apps needing vector + operational data together | Decide what to embed vs. reference before prompting; specify shard key for collections above 10 GB; add indexes on all query fields; avoid unbounded arrays in documents |
| DynamoDB / Cassandra (Key-Value / Wide-Column NoSQL) | High-throughput key-value lookups, session stores, leaderboards, time-series data, write-heavy workloads at extreme scale | Specify all access patterns upfront; define partition key and sort key explicitly; specify consistency level; avoid monotonically increasing shard/partition keys |
| S3 / GCS (Object Storage) | Binary files: images, videos, backups, AI model weights, training datasets, static assets | Store only the object key in the database; use presigned URLs for uploads and downloads; serve via CDN |
| Polyglot persistence | Real-world systems that need multiple storage types — relational metadata + document store + file bytes + high-throughput lookups | Specify which data lives in which system before prompting; define who owns the reference between systems |
| AI agent bias | Defaults to PostgreSQL for everything; never recommends NoSQL or MongoDB; stores files in the database | Set storage architecture explicitly in your spec before generating any data layer code |
The storage decision is one of the hardest to reverse. Migrating a production database from SQL to NoSQL, or extracting file bytes out of a BLOB column into S3, requires careful data migration, service changes, and often downtime — all after the fact. Set the storage architecture before you prompt, and the AI agent will generate implementation code that fits the system you actually need. Skip this step, and you may spend weeks undoing what the AI built in minutes.
Sources:
- PostgreSQL vs MySQL 2025: Complete Database Comparison — Liquibase
- PostgreSQL vs MySQL: Which Database Should You Choose in 2026? — Bytebase
- Cassandra vs DynamoDB 2026: Complete Guide — Knowi
- DynamoDB vs Cassandra — Bytebase
- The What, Why, and When of Single-Table Design with DynamoDB — Alex DeBrie
- AWS S3 vs Google Cloud Storage Comparison — AutoMQ
- Sharing objects with presigned URLs — AWS Documentation
- SQL vs NoSQL: Choosing the Right Database — ByteByteGo
- Top 5 NoSQL Database Myths — DataStax
- It's 2026, Just Use Postgres — Tiger Data
- Best Database Solutions for AI Agents (2025) — Fast.io
- Understanding Postgres GIN Indexes — pganalyze
- Apache Cassandra Case Studies — cassandra.apache.org
- MongoDB Sharding Documentation — mongodb.com
- MongoDB Schema Design Anti-Patterns — mongodb.com
- MongoDB ACID Transactions — mongodb.com
- MongoDB Atlas Vector Search — mongodb.com
- NoSQL Labyrinth: DynamoDB, Cassandra, MongoDB, Redis — Medium