Schema Design Fundamentals
Before you write a single line of code — or prompt an AI agent to write it — you need to decide what data your system stores and how that data is structured. This is schema design: the practice of defining entities, their attributes, and the relationships between them before writing any code.
Getting the schema right from the start pays off at every stage. A well-designed schema is easy for AI agents to extend correctly, easy for developers to query efficiently, and resistant to the silent data corruption bugs that compound over time. A poorly designed schema does the opposite — and fixing it later means writing risky migrations, updating every query, and hoping nothing breaks in production.
This tutorial covers the foundational vocabulary and decisions every developer needs: what goes into a schema, how relationships are modeled, the constraints that enforce data correctness, and the specific mistakes AI agents consistently make when generating schemas.
From Spreadsheet to Schema#
A helpful mental model: think of a table as a spreadsheet. Each column is a named field with a specific type (name, email, timestamp). Each row is one record. The critical difference is that a relational database enforces the structure — the data types, the constraints, the connections between tables. That enforcement is what makes data trustworthy at scale. If an application tries to insert a row that violates any rule — a missing required field, a duplicate unique value, or a reference to a non-existent parent row — the database rejects it outright.
A schema is the full definition of all your tables, columns, types, constraints, and relationships. It is the contract between your application and your database.
| Spreadsheet Concept | Database Equivalent | Key Difference |
|---|---|---|
| Sheet | Table | Tables enforce types and constraints on every row — invalid data is rejected |
| Column header | Column with a data type | Columns have strict types (INT, TEXT, TIMESTAMP), not freeform values |
| Row | Record | Rows must satisfy all column constraints to be inserted |
| Any value in a cell | Typed, constrained cell | The database rejects values that violate the column's type or constraints |
| Sheet reference (=Sheet2!A1) | Foreign key | Foreign keys are enforced — orphaned references pointing to non-existent rows are prevented |
Entities, Attributes, and Tables#
The starting point for any schema is identifying your entities — the meaningful "things" your system tracks. Entities become tables. Their properties become columns.
- An entity is a distinct object or concept:
User,Order,Product,Post - An attribute is a property of that entity: a
Userhas aname,email, andcreated_at - A relationship connects entities: a
UserplacesOrders
This analysis — identifying entities, attributes, and relationships — produces an Entity-Relationship Diagram (ERD), the visual blueprint of your schema. An ERD shows every table, its columns, and how the tables connect to each other. It is the most valuable artifact you can give an AI agent before asking it to generate any schema code, because it gives the agent a concrete specification to implement rather than a vague domain to interpret.
ERD: A Blog Platform
A blog platform has users who write posts, posts that have comments, and posts that are tagged. This ERD captures all entities, their key attributes, and how they relate. Each relationship line reads as a sentence: 'A User writes zero or many Posts. Each Post belongs to exactly one User.'
A note on table naming: Use plural nouns for table names (users, not user; orders, not order) — this is the dominant SQL convention and the one followed throughout this tutorial. Your ORM entity classes and objects can still use singular names (User, Order); the table name in the database is what should be plural. Whatever convention your team chooses, document it in your CLAUDE.md or project rules file so every AI session follows the same convention automatically.
Primary Keys, Foreign Keys, and Constraints#
Constraints are the database's enforcement mechanism. They are the rules that make data trustworthy — rules that the database guarantees at every write, so your application code does not have to duplicate them. Even if a bug in your application tries to insert invalid data, a constraint will stop it at the database level.
Primary Keys#
Every table should have a primary key (PK) — a column whose value uniquely identifies each row. Primary keys cannot be NULL and cannot contain duplicates. Every row is guaranteed to be reachable by its primary key.
Use a surrogate key, not a natural key.
A natural key is something that exists in the real world: an email address, a phone number, or a username. The temptation is to use it as the primary key because it is already unique. The problem: it changes. A user updates their email, and now every table that stored that email as a foreign key must also be updated. Cascading updates across multiple tables are error-prone, slow, and can corrupt historical records — for example, an order that was placed under the old email may point to nothing after the update.
A surrogate key is synthetic — an auto-incremented integer (SERIAL / BIGSERIAL in Postgres) or a UUID (gen_random_uuid() in Postgres 13+). It carries no business meaning, never changes, and never needs to be updated. UUIDs are especially useful in distributed systems where records may be created across multiple services before being written to a central database, since UUIDs can be generated independently on any client without coordination.
-- Good: surrogate key
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email TEXT NOT NULL UNIQUE, -- unique, but not the PK; can be indexed separately
username TEXT NOT NULL
);
-- Risky: natural key as PK
CREATE TABLE users (
email TEXT PRIMARY KEY, -- what happens when a user changes their email?
username TEXT NOT NULL
);
Foreign Keys#
A foreign key (FK) is a column in a child table that references the primary key of a parent table. It enforces referential integrity: the database will reject any attempt to insert a child row pointing to a non-existent parent.
CREATE TABLE posts (
id SERIAL PRIMARY KEY,
author_id INTEGER NOT NULL,
title TEXT NOT NULL,
FOREIGN KEY (author_id) REFERENCES users(id) ON DELETE CASCADE
);
The ON DELETE directive tells the database what to do when the referenced parent row is deleted:
| Directive | Behavior When Parent Is Deleted | When to Use |
|---|---|---|
ON DELETE CASCADE | Automatically delete all child rows | When children have no meaning without the parent (posts belong to a user — delete the user, delete their posts) |
ON DELETE RESTRICT | Prevent deletion if any child rows exist | When orphaning would be a data bug (cannot delete a product that order line items reference) |
ON DELETE SET NULL | Set the FK column to NULL on the child rows | When children can exist without a parent (delete a comment author but keep the comment) |
ON DELETE SET DEFAULT | Reset the FK to a default value | When a fallback parent exists — rare, useful for a 'system' or 'unknown' user record |
Always add an index on your foreign key columns. Without an index, every JOIN on that column forces the database to scan the entire table — checking every row to find matching records. At scale with millions of rows, this becomes catastrophically slow. This is one of the most consistent mistakes in AI-generated schemas: the FK constraint is declared, but the index is missing.
-- Always add this index after declaring a foreign key:
CREATE INDEX idx_posts_author_id ON posts(author_id);
Column Constraints#
| Constraint | What It Enforces | Example |
|---|---|---|
NOT NULL | Column must always have a value — NULLs are rejected on insert/update | email TEXT NOT NULL |
UNIQUE | No two rows can have the same value in this column | email TEXT UNIQUE — prevents duplicate accounts |
CHECK | Column value must satisfy a logical condition | CHECK (price >= 0) — prevents negative prices |
DEFAULT | Provides an automatic value when none is supplied | created_at TIMESTAMP DEFAULT NOW() |
AI agents routinely omit constraints. An AI generates syntactically valid SQL without NOT NULL or UNIQUE because the model's goal is to complete the prompt — not to enforce your data integrity rules. The missing constraints are silent failures: the schema runs, tests pass, and then one day a bug inserts a NULL into an email column, and your authentication query matches every user with a NULL email instead of nobody. Review every AI-generated schema for missing constraints before running any migration.
The Three Relationship Types#
Every relationship between two entities falls into one of three categories. Getting the category right is critical — it determines where the foreign key lives, whether you need a junction table, and what constraints are needed to prevent duplicate or orphaned data.
Relationship Types: 1:1, 1:N, and M:N
The three relationship types and their database implementations. The key question: how many instances of B can relate to one instance of A, and vice versa? The answer determines foreign key placement and whether a junction table is required.
One-to-One (1:1) — Selective Splits#
One parent row relates to exactly one child row, and each child belongs to exactly one parent. The foreign key can live in either table, but it must have a UNIQUE constraint — without it, the database allows multiple child rows to reference the same parent, silently degrading the relationship into 1:N.
Use 1:1 relationships sparingly. Common reasons to split a table this way include:
- Security isolation: storing sensitive fields (payment credentials, government IDs) in a separate table with stricter access controls
- Performance: keeping large optional fields (profile bios, avatar metadata, settings blobs) out of the main table so that queries that don't need them stay fast
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email TEXT NOT NULL UNIQUE,
username TEXT NOT NULL
);
-- 1:1: each user has at most one extended profile
CREATE TABLE user_profiles (
id SERIAL PRIMARY KEY,
user_id INTEGER NOT NULL UNIQUE, -- UNIQUE enforces the 1:1 limit
bio TEXT,
avatar_url TEXT,
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE
);
If you always query the profile together with the user, ask yourself whether the JOIN is worth the split. A 1:1 relationship only pays off when there is a concrete access pattern or security reason to keep the data separate.
One-to-Many (1:N) — The Workhorse#
One parent row relates to many child rows; each child belongs to exactly one parent. The foreign key always lives in the child (many) table.
-- One customer places many orders
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL
);
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
customer_id INTEGER NOT NULL, -- FK lives in the "many" side
total_amount NUMERIC(10, 2) NOT NULL,
created_at TIMESTAMP DEFAULT NOW(),
FOREIGN KEY (customer_id) REFERENCES customers(id) ON DELETE RESTRICT
);
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
Many-to-Many (M:N) — The Junction Table#
When both sides of a relationship can have many connections, a single foreign key is not sufficient — there is no single table to put it in without duplicating data. You need a junction table (also called a join table or bridge table) — a dedicated table that holds a foreign key to each side and whose combined keys form a composite primary key.
CREATE TABLE post_tags (
post_id INTEGER NOT NULL,
tag_id INTEGER NOT NULL,
PRIMARY KEY (post_id, tag_id), -- composite PK prevents duplicate tag assignments
FOREIGN KEY (post_id) REFERENCES posts(id) ON DELETE CASCADE,
FOREIGN KEY (tag_id) REFERENCES tags(id) ON DELETE CASCADE
);
The junction table's composite primary key (post_id, tag_id) serves two purposes: it uniquely identifies each row, and it enforces the rule that a post cannot be tagged with the same tag twice. Junction tables can also carry additional columns that describe the relationship itself — an enrollment date, a user's role within a group, or an assigned-at timestamp — making them richer than a simple link between two tables.
Cardinality Notation: Reading ERDs#
When you read or generate ERDs, you will encounter Crow's Foot notation — the standard visual language for expressing how many rows can participate in a relationship. Each end of a relationship line has two symbols: one showing the minimum (can zero rows participate?) and one showing the maximum (is there a cap of one, or can there be many?). Learning to read it lets you verify AI-generated ERDs in seconds.
Always read each relationship line from both directions to fully understand the constraint:
CUSTOMER ||--o{ ORDER→ "A customer places zero or many orders. Each order belongs to exactly one customer."ORDER ||--|{ LINE_ITEM→ "An order contains one or many line items. Each line item belongs to exactly one order."
| Symbol (line end) | Meaning | Example reading |
|---|---|---|
|| (two bars) | Exactly one — mandatory | Each order belongs to exactly one customer |
o| (circle + bar) | Zero or one — optional | A user may have zero or one billing address |
o{ (circle + crow's foot) | Zero or many — optional | A customer may place zero or many orders |
|{ (bar + crow's foot) | One or many — mandatory | An order must contain at least one line item |
The inner symbol (the one closest to the entity box) represents the minimum cardinality — a circle means zero (the relationship is optional), a vertical bar means one (mandatory, at least one must exist). The outer symbol represents the maximum cardinality — a single vertical bar means at most one, a crow's foot (the three-pronged mark) means many.
JSONB vs. Normalized Columns#
Modern databases like PostgreSQL let you store JSON objects directly in a column using the JSONB type. This is genuinely useful in specific situations, but it is frequently misused as a shortcut to avoid thinking through the data model — and the performance consequences at scale can be severe.
| Scenario | Normalized Column | JSONB |
|---|---|---|
You filter or sort by this field (WHERE status = 'active') | Yes — index works, query planner has statistics | No — JSONB filtering is significantly slower without a GIN index, and even with one the planner lacks column statistics |
| The field's structure varies per row (event payloads, user-defined forms) | No — you would need too many nullable columns | Yes — the schema varies legitimately row to row |
| The field appears in almost every row | Yes — predictable access, compact storage | No — JSONB storage is roughly 2× the size of equivalent normalized columns |
| You retrieve the field as a whole object without filtering inside it | Either works | Yes — JSONB is convenient for whole-object reads |
| You need to JOIN on this field or use it in aggregations | Yes — required for reasonable performance | No — JSONB fields cannot participate in FK constraints or standard index scans |
The performance trap: PostgreSQL's query planner cannot collect statistics on fields inside a JSONB column the way it can for regular columns. When you filter by a JSONB field (e.g., WHERE data->>'status' = 'active'), the planner falls back to a hardcoded 0.1% selectivity estimate — meaning it assumes the filter matches almost nobody, regardless of the actual data distribution. This causes the planner to choose catastrophically bad query plans. Heap Engineering documented a real case where a query ran in 584 seconds instead of 300ms — a 2,000× slowdown — because the planner had no column statistics to work with on a JSONB field.
The rule of thumb: if you will ever filter, sort, or join on a field, it belongs in a dedicated normalized column. JSONB is appropriate only for genuinely dynamic, schema-varying data that you retrieve as a whole object — event payloads, user-defined form responses, or configuration blobs where the shape of the data legitimately varies from row to row.
Naming Conventions: Consistency the AI Will Not Provide#
AI agents apply no consistent naming convention unless you explicitly specify one. Left unconstrained, an agent will mix userId, user_id, and UserID across the same schema — a patchwork that breaks ORM auto-mapping, confuses query builders, and causes silent bugs in case-sensitive databases.
| Thing to Name | Convention | Examples |
|---|---|---|
| Tables | lowercase plural with underscores | users, orders, post_tags |
| Columns | lowercase with underscores | first_name, created_at, is_active |
| Primary key | id in every table | id SERIAL PRIMARY KEY |
| Foreign keys | <referenced_table_singular>_id | customer_id, post_id, author_id |
| Boolean columns | Prefix with is_, has_, or can_ | is_active, has_verified_email, can_publish |
| Timestamp columns | Suffix with _at for events, _date for dates | created_at, published_at, birth_date |
| Junction tables | Both table names joined with underscore | post_tags, student_courses, user_roles |
| Indexes | idx_<table>_<column> | idx_orders_customer_id, idx_posts_author_id |
Put your naming conventions in a CLAUDE.md or project rules file. Every AI session that generates schema code will then follow them automatically, rather than inventing its own convention from session to session.
How AI Agents Get Schema Design Wrong#
Schema design is one of the places where AI agents are most likely to produce something that looks correct and runs without errors but is structurally wrong. These failure patterns are consistent and predictable — which means you can check for them systematically on every AI-generated schema before it touches production.
AI Schema Design Failure Modes
AI agents generate syntactically valid SQL but routinely miss semantic correctness — the structural rules that make data trustworthy. Research confirms that LLMs do not infer cardinality constraints properly, omit foreign key declarations, and produce inconsistent naming unless given explicit instructions.
Your Schema as a Spec#
The most important practice when working with AI agents on data modeling is to treat the schema as your specification — the most important architectural document you produce before any code is generated. Code changes are cheap; schema migrations on a production database with live data are expensive and risky. A missed constraint or a wrong relationship type is easy to fix in development and painful to fix after launch.
The review step in this workflow is not optional. An AI agent that generates a schema without constraints produces a database that accepts invalid data silently — and those silent failures surface as data integrity bugs weeks or months later, in production, when they are hardest to diagnose and most expensive to fix.
Summary#
| Concept | The Key Decision | Common AI Mistake |
|---|---|---|
| Entities and attributes | Identify the main nouns in your system; each becomes a table with typed columns | Adding extra tables not requested, or missing junction tables for M:N relationships |
| Primary keys | Always use surrogate keys (SERIAL or UUID) — never natural keys like email or username | Using email or username as PK; breaks when users change them, cascades to every child table |
| Foreign keys | FK lives in the child table; declare ON DELETE behavior and add an index on the FK column | Omitting FK declarations entirely, or declaring FKs without the required index |
| 1:1 relationships | FK on either side with a UNIQUE constraint to enforce the one-to-one limit | Treating 1:1 as 1:N — no UNIQUE constraint, allowing multiple children |
| 1:N relationships | FK on the many (child) side | Placing the FK on the wrong side, or modeling 1:N as M:N unnecessarily |
| M:N relationships | Junction table with FKs to both sides and a composite primary key | Generating a 1:N structure and losing the many-to-many semantics |
| Constraints | NOT NULL on required fields, UNIQUE on unique fields, CHECK for domain rules | Omitting all constraints — passes syntax checks but allows corrupt data |
| JSONB vs. columns | Normalize fields you filter on; use JSONB only for genuinely dynamic, whole-object data | Using JSONB 'for flexibility' on filterable fields, causing 2,000× query slowdowns at scale |
| Naming conventions | lowercase_with_underscores everywhere; FK named as <table>_id | Mixing camelCase, snake_case, and PascalCase across the same schema |
Sources:
- Introduction of ER Model - GeeksforGeeks
- Crow's Foot Notation - FreeCodeCamp
- Primary and Foreign Key Constraints - Microsoft Learn
- Relationships in SQL - GeeksforGeeks
- When To Avoid JSONB In A PostgreSQL Schema - Heap
- Seven Essential Database Schema Best Practices - Fivetran
- 10 Rules for a Better SQL Schema - Sisense
- Experimenting With Generative AI For Database Design - Medium
- Why Your AI Coding Assistant Keeps Doing It Wrong - Pete Hodgson