Schema Design Fundamentals

Before you write a single line of code — or prompt an AI agent to write it — you need to decide what data your system stores and how that data is structured. This is schema design: the practice of defining entities, their attributes, and the relationships between them before writing any code.

Getting the schema right from the start pays off at every stage. A well-designed schema is easy for AI agents to extend correctly, easy for developers to query efficiently, and resistant to the silent data corruption bugs that compound over time. A poorly designed schema does the opposite — and fixing it later means writing risky migrations, updating every query, and hoping nothing breaks in production.

This tutorial covers the foundational vocabulary and decisions every developer needs: what goes into a schema, how relationships are modeled, the constraints that enforce data correctness, and the specific mistakes AI agents consistently make when generating schemas.

From Spreadsheet to Schema#

A helpful mental model: think of a table as a spreadsheet. Each column is a named field with a specific type (name, email, timestamp). Each row is one record. The critical difference is that a relational database enforces the structure — the data types, the constraints, the connections between tables. That enforcement is what makes data trustworthy at scale. If an application tries to insert a row that violates any rule — a missing required field, a duplicate unique value, or a reference to a non-existent parent row — the database rejects it outright.

A schema is the full definition of all your tables, columns, types, constraints, and relationships. It is the contract between your application and your database.

Spreadsheet Concept	Database Equivalent	Key Difference
Sheet	Table	Tables enforce types and constraints on every row — invalid data is rejected
Column header	Column with a data type	Columns have strict types (`INT`, `TEXT`, `TIMESTAMP`), not freeform values
Row	Record	Rows must satisfy all column constraints to be inserted
Any value in a cell	Typed, constrained cell	The database rejects values that violate the column's type or constraints
Sheet reference (=Sheet2!A1)	Foreign key	Foreign keys are enforced — orphaned references pointing to non-existent rows are prevented

Entities, Attributes, and Tables#

The starting point for any schema is identifying your entities — the meaningful "things" your system tracks. Entities become tables. Their properties become columns.

An entity is a distinct object or concept: User, Order, Product, Post
An attribute is a property of that entity: a User has a name, email, and created_at
A relationship connects entities: a User places Orders

This analysis — identifying entities, attributes, and relationships — produces an Entity-Relationship Diagram (ERD), the visual blueprint of your schema. An ERD shows every table, its columns, and how the tables connect to each other. It is the most valuable artifact you can give an AI agent before asking it to generate any schema code, because it gives the agent a concrete specification to implement rather than a vague domain to interpret.

ERD: A Blog Platform

A blog platform has users who write posts, posts that have comments, and posts that are tagged. This ERD captures all entities, their key attributes, and how they relate. Each relationship line reads as a sentence: 'A User writes zero or many Posts. Each Post belongs to exactly one User.'

Rendering diagram...

A note on table naming: Use plural nouns for table names (users, not user; orders, not order) — this is the dominant SQL convention and the one followed throughout this tutorial. Your ORM entity classes and objects can still use singular names (User, Order); the table name in the database is what should be plural. Whatever convention your team chooses, document it in your CLAUDE.md or project rules file so every AI session follows the same convention automatically.

Primary Keys, Foreign Keys, and Constraints#

Constraints are the database's enforcement mechanism. They are the rules that make data trustworthy — rules that the database guarantees at every write, so your application code does not have to duplicate them. Even if a bug in your application tries to insert invalid data, a constraint will stop it at the database level.

Primary Keys#

Every table should have a primary key (PK) — a column whose value uniquely identifies each row. Primary keys cannot be NULL and cannot contain duplicates. Every row is guaranteed to be reachable by its primary key.

Use a surrogate key, not a natural key.

A natural key is something that exists in the real world: an email address, a phone number, or a username. The temptation is to use it as the primary key because it is already unique. The problem: it changes. A user updates their email, and now every table that stored that email as a foreign key must also be updated. Cascading updates across multiple tables are error-prone, slow, and can corrupt historical records — for example, an order that was placed under the old email may point to nothing after the update.

A surrogate key is synthetic — an auto-incremented integer (SERIAL / BIGSERIAL in Postgres) or a UUID (gen_random_uuid() in Postgres 13+). It carries no business meaning, never changes, and never needs to be updated. UUIDs are especially useful in distributed systems where records may be created across multiple services before being written to a central database, since UUIDs can be generated independently on any client without coordination.

-- Good: surrogate key
CREATE TABLE users (
    id    SERIAL PRIMARY KEY,
    email TEXT NOT NULL UNIQUE,  -- unique, but not the PK; can be indexed separately
    username TEXT NOT NULL
);

-- Risky: natural key as PK
CREATE TABLE users (
    email    TEXT PRIMARY KEY,  -- what happens when a user changes their email?
    username TEXT NOT NULL
);

Foreign Keys#

A foreign key (FK) is a column in a child table that references the primary key of a parent table. It enforces referential integrity: the database will reject any attempt to insert a child row pointing to a non-existent parent.

CREATE TABLE posts (
    id        SERIAL PRIMARY KEY,
    author_id INTEGER NOT NULL,
    title     TEXT NOT NULL,
    FOREIGN KEY (author_id) REFERENCES users(id) ON DELETE CASCADE
);

The ON DELETE directive tells the database what to do when the referenced parent row is deleted:

Directive	Behavior When Parent Is Deleted	When to Use
`ON DELETE CASCADE`	Automatically delete all child rows	When children have no meaning without the parent (posts belong to a user — delete the user, delete their posts)
`ON DELETE RESTRICT`	Prevent deletion if any child rows exist	When orphaning would be a data bug (cannot delete a product that order line items reference)
`ON DELETE SET NULL`	Set the FK column to NULL on the child rows	When children can exist without a parent (delete a comment author but keep the comment)
`ON DELETE SET DEFAULT`	Reset the FK to a default value	When a fallback parent exists — rare, useful for a 'system' or 'unknown' user record

Always add an index on your foreign key columns. Without an index, every JOIN on that column forces the database to scan the entire table — checking every row to find matching records. At scale with millions of rows, this becomes catastrophically slow. This is one of the most consistent mistakes in AI-generated schemas: the FK constraint is declared, but the index is missing.

-- Always add this index after declaring a foreign key:
CREATE INDEX idx_posts_author_id ON posts(author_id);

Column Constraints#

Constraint	What It Enforces	Example
`NOT NULL`	Column must always have a value — NULLs are rejected on insert/update	`email TEXT NOT NULL`
`UNIQUE`	No two rows can have the same value in this column	`email TEXT UNIQUE` — prevents duplicate accounts
`CHECK`	Column value must satisfy a logical condition	`CHECK (price >= 0)` — prevents negative prices
`DEFAULT`	Provides an automatic value when none is supplied	`created_at TIMESTAMP DEFAULT NOW()`

AI agents routinely omit constraints. An AI generates syntactically valid SQL without NOT NULL or UNIQUE because the model's goal is to complete the prompt — not to enforce your data integrity rules. The missing constraints are silent failures: the schema runs, tests pass, and then one day a bug inserts a NULL into an email column, and your authentication query matches every user with a NULL email instead of nobody. Review every AI-generated schema for missing constraints before running any migration.

The Three Relationship Types#

Every relationship between two entities falls into one of three categories. Getting the category right is critical — it determines where the foreign key lives, whether you need a junction table, and what constraints are needed to prevent duplicate or orphaned data.

Relationship Types: 1:1, 1:N, and M:N

The three relationship types and their database implementations. The key question: how many instances of B can relate to one instance of A, and vice versa? The answer determines foreign key placement and whether a junction table is required.

Rendering diagram...

One-to-One (1:1) — Selective Splits#

One parent row relates to exactly one child row, and each child belongs to exactly one parent. The foreign key can live in either table, but it must have a UNIQUE constraint — without it, the database allows multiple child rows to reference the same parent, silently degrading the relationship into 1:N.

Use 1:1 relationships sparingly. Common reasons to split a table this way include:

Security isolation: storing sensitive fields (payment credentials, government IDs) in a separate table with stricter access controls
Performance: keeping large optional fields (profile bios, avatar metadata, settings blobs) out of the main table so that queries that don't need them stay fast

CREATE TABLE users (
    id       SERIAL PRIMARY KEY,
    email    TEXT NOT NULL UNIQUE,
    username TEXT NOT NULL
);

-- 1:1: each user has at most one extended profile
CREATE TABLE user_profiles (
    id         SERIAL PRIMARY KEY,
    user_id    INTEGER NOT NULL UNIQUE,   -- UNIQUE enforces the 1:1 limit
    bio        TEXT,
    avatar_url TEXT,
    FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE
);

If you always query the profile together with the user, ask yourself whether the JOIN is worth the split. A 1:1 relationship only pays off when there is a concrete access pattern or security reason to keep the data separate.

One-to-Many (1:N) — The Workhorse#

One parent row relates to many child rows; each child belongs to exactly one parent. The foreign key always lives in the child (many) table.

-- One customer places many orders
CREATE TABLE customers (
    id   SERIAL PRIMARY KEY,
    name TEXT NOT NULL
);

CREATE TABLE orders (
    id            SERIAL PRIMARY KEY,
    customer_id   INTEGER NOT NULL,       -- FK lives in the "many" side
    total_amount  NUMERIC(10, 2) NOT NULL,
    created_at    TIMESTAMP DEFAULT NOW(),
    FOREIGN KEY (customer_id) REFERENCES customers(id) ON DELETE RESTRICT
);

CREATE INDEX idx_orders_customer_id ON orders(customer_id);

Many-to-Many (M:N) — The Junction Table#

When both sides of a relationship can have many connections, a single foreign key is not sufficient — there is no single table to put it in without duplicating data. You need a junction table (also called a join table or bridge table) — a dedicated table that holds a foreign key to each side and whose combined keys form a composite primary key.

Without Junction Table (broken)

Rendering diagram...

With Junction Table (correct)

Rendering diagram...

CREATE TABLE post_tags (
    post_id  INTEGER NOT NULL,
    tag_id   INTEGER NOT NULL,
    PRIMARY KEY (post_id, tag_id),          -- composite PK prevents duplicate tag assignments
    FOREIGN KEY (post_id) REFERENCES posts(id) ON DELETE CASCADE,
    FOREIGN KEY (tag_id)  REFERENCES tags(id)  ON DELETE CASCADE
);

The junction table's composite primary key (post_id, tag_id) serves two purposes: it uniquely identifies each row, and it enforces the rule that a post cannot be tagged with the same tag twice. Junction tables can also carry additional columns that describe the relationship itself — an enrollment date, a user's role within a group, or an assigned-at timestamp — making them richer than a simple link between two tables.

Cardinality Notation: Reading ERDs#

When you read or generate ERDs, you will encounter Crow's Foot notation — the standard visual language for expressing how many rows can participate in a relationship. Each end of a relationship line has two symbols: one showing the minimum (can zero rows participate?) and one showing the maximum (is there a cap of one, or can there be many?). Learning to read it lets you verify AI-generated ERDs in seconds.

E-Commerce ERD with Crow's Foot Notation

Rendering diagram...

Always read each relationship line from both directions to fully understand the constraint:

CUSTOMER ||--o{ ORDER → "A customer places zero or many orders. Each order belongs to exactly one customer."
ORDER ||--|{ LINE_ITEM → "An order contains one or many line items. Each line item belongs to exactly one order."

Symbol (line end)	Meaning	Example reading
`\|\|` (two bars)	Exactly one — mandatory	Each order belongs to exactly one customer
`o\|` (circle + bar)	Zero or one — optional	A user may have zero or one billing address
`o{` (circle + crow's foot)	Zero or many — optional	A customer may place zero or many orders
`\|{` (bar + crow's foot)	One or many — mandatory	An order must contain at least one line item

The inner symbol (the one closest to the entity box) represents the minimum cardinality — a circle means zero (the relationship is optional), a vertical bar means one (mandatory, at least one must exist). The outer symbol represents the maximum cardinality — a single vertical bar means at most one, a crow's foot (the three-pronged mark) means many.

JSONB vs. Normalized Columns#

Modern databases like PostgreSQL let you store JSON objects directly in a column using the JSONB type. This is genuinely useful in specific situations, but it is frequently misused as a shortcut to avoid thinking through the data model — and the performance consequences at scale can be severe.

Scenario	Normalized Column	JSONB
You filter or sort by this field (`WHERE status = 'active'`)	Yes — index works, query planner has statistics	No — JSONB filtering is significantly slower without a GIN index, and even with one the planner lacks column statistics
The field's structure varies per row (event payloads, user-defined forms)	No — you would need too many nullable columns	Yes — the schema varies legitimately row to row
The field appears in almost every row	Yes — predictable access, compact storage	No — JSONB storage is roughly 2× the size of equivalent normalized columns
You retrieve the field as a whole object without filtering inside it	Either works	Yes — JSONB is convenient for whole-object reads
You need to JOIN on this field or use it in aggregations	Yes — required for reasonable performance	No — JSONB fields cannot participate in FK constraints or standard index scans

The performance trap: PostgreSQL's query planner cannot collect statistics on fields inside a JSONB column the way it can for regular columns. When you filter by a JSONB field (e.g., WHERE data->>'status' = 'active'), the planner falls back to a hardcoded 0.1% selectivity estimate — meaning it assumes the filter matches almost nobody, regardless of the actual data distribution. This causes the planner to choose catastrophically bad query plans. Heap Engineering documented a real case where a query ran in 584 seconds instead of 300ms — a 2,000× slowdown — because the planner had no column statistics to work with on a JSONB field.

The rule of thumb: if you will ever filter, sort, or join on a field, it belongs in a dedicated normalized column. JSONB is appropriate only for genuinely dynamic, schema-varying data that you retrieve as a whole object — event payloads, user-defined form responses, or configuration blobs where the shape of the data legitimately varies from row to row.

Naming Conventions: Consistency the AI Will Not Provide#

AI agents apply no consistent naming convention unless you explicitly specify one. Left unconstrained, an agent will mix userId, user_id, and UserID across the same schema — a patchwork that breaks ORM auto-mapping, confuses query builders, and causes silent bugs in case-sensitive databases.

Thing to Name	Convention	Examples
Tables	lowercase plural with underscores	`users`, `orders`, `post_tags`
Columns	lowercase with underscores	`first_name`, `created_at`, `is_active`
Primary key	`id` in every table	`id SERIAL PRIMARY KEY`
Foreign keys	`<referenced_table_singular>_id`	`customer_id`, `post_id`, `author_id`
Boolean columns	Prefix with `is_`, `has_`, or `can_`	`is_active`, `has_verified_email`, `can_publish`
Timestamp columns	Suffix with `_at` for events, `_date` for dates	`created_at`, `published_at`, `birth_date`
Junction tables	Both table names joined with underscore	`post_tags`, `student_courses`, `user_roles`
Indexes	`idx_<table>_<column>`	`idx_orders_customer_id`, `idx_posts_author_id`

Put your naming conventions in a CLAUDE.md or project rules file. Every AI session that generates schema code will then follow them automatically, rather than inventing its own convention from session to session.

How AI Agents Get Schema Design Wrong#

Schema design is one of the places where AI agents are most likely to produce something that looks correct and runs without errors but is structurally wrong. These failure patterns are consistent and predictable — which means you can check for them systematically on every AI-generated schema before it touches production.

AI Schema Design Failure Modes

AI agents generate syntactically valid SQL but routinely miss semantic correctness — the structural rules that make data trustworthy. Research confirms that LLMs do not infer cardinality constraints properly, omit foreign key declarations, and produce inconsistent naming unless given explicit instructions.

Rendering diagram...

Your Schema as a Spec#

The most important practice when working with AI agents on data modeling is to treat the schema as your specification — the most important architectural document you produce before any code is generated. Code changes are cheap; schema migrations on a production database with live data are expensive and risky. A missed constraint or a wrong relationship type is easy to fix in development and painful to fix after launch.

Schema-First Development Workflow

Rendering diagram...

The review step in this workflow is not optional. An AI agent that generates a schema without constraints produces a database that accepts invalid data silently — and those silent failures surface as data integrity bugs weeks or months later, in production, when they are hardest to diagnose and most expensive to fix.

Summary#

Concept	The Key Decision	Common AI Mistake
Entities and attributes	Identify the main nouns in your system; each becomes a table with typed columns	Adding extra tables not requested, or missing junction tables for M:N relationships
Primary keys	Always use surrogate keys (`SERIAL` or UUID) — never natural keys like email or username	Using email or username as PK; breaks when users change them, cascades to every child table
Foreign keys	FK lives in the child table; declare `ON DELETE` behavior and add an index on the FK column	Omitting FK declarations entirely, or declaring FKs without the required index
1:1 relationships	FK on either side with a `UNIQUE` constraint to enforce the one-to-one limit	Treating 1:1 as 1:N — no `UNIQUE` constraint, allowing multiple children
1:N relationships	FK on the many (child) side	Placing the FK on the wrong side, or modeling 1:N as M:N unnecessarily
M:N relationships	Junction table with FKs to both sides and a composite primary key	Generating a 1:N structure and losing the many-to-many semantics
Constraints	`NOT NULL` on required fields, `UNIQUE` on unique fields, `CHECK` for domain rules	Omitting all constraints — passes syntax checks but allows corrupt data
JSONB vs. columns	Normalize fields you filter on; use JSONB only for genuinely dynamic, whole-object data	Using JSONB 'for flexibility' on filterable fields, causing 2,000× query slowdowns at scale
Naming conventions	lowercase_with_underscores everywhere; FK named as `<table>_id`	Mixing `camelCase`, `snake_case`, and `PascalCase` across the same schema

Sources:

PreviousLoad Balancing & Traffic

NextIndexes

Schema Design Fundamentals

ERD: A Blog Platform

Relationship Types: 1:1, 1:N, and M:N

AI Schema Design Failure Modes

Arch Advisor