Indexes
You've designed your schema and written your queries. The application works perfectly with test data. Then it goes live, traffic grows, and queries that returned results in milliseconds now take several seconds. Before you reach for more hardware, look at a simpler and far more powerful lever first: a database index.
Indexes are one of the highest-leverage tools in system design — a single well-placed index can turn a 5-second query into a 5-millisecond one. They are also one of the most common things AI coding agents overlook. AI agents produce schemas and queries that are logically correct but routinely omit the indexes that make those schemas perform at scale. Understanding indexes is not optional: it is how you catch what the AI missed before it reaches production.
The Book Analogy#
An index in a database works exactly like the index at the back of a non-fiction book. Without an index, finding every reference to "load balancing" in a 600-page book means reading every page. With an index, you look up "load balancing" alphabetically, get a list of page numbers, and jump directly to each one.
A database works the same way. Without an index, a query like WHERE user_id = 42 on a million-row table requires the database to read every row and check each one against the condition — a sequential scan. With an index on user_id, the database looks up the value in the index and jumps directly to the matching rows.
The reason tables need indexes comes down to how they store data. Unlike a spreadsheet where you might keep rows sorted, a database table has no inherent row order. New rows are written into the next available space on disk — they don't slot into a sorted position. The result is that the table's physical layout is an unsorted heap with no organizational structure that supports searching. An index is the structure you add to make searching efficient. It is maintained separately from the table and stays in sorted order, giving the database a fast path to specific rows without reading the entire table.
How a B-Tree Index Works#
The most common index type in relational databases (PostgreSQL, MySQL, SQLite) is the B-Tree. When you run CREATE INDEX without any additional options, a B-Tree index is created by default.
B-Tree Index Structure
A B-Tree has three layers: the root node at the top for navigation, branch nodes in the middle, and leaf nodes at the bottom that store the actual index entries and pointers to rows in the main table. Leaf nodes are linked in sorted order, enabling efficient range scans without re-traversing the tree.
The diagram above is a simplified illustration with 3 levels and only 3 entries per node. In practice, each B-Tree node holds hundreds of entries, which is why the tree stays remarkably shallow — typically 4–5 levels — even for tables with millions of rows.
How a search works in practice: To find user_id = 42, the database starts at the root, reads the branch node entries to determine which subtree contains 42, and traverses down to the appropriate leaf node. The entire path is 4–5 node reads for a million-row table. The root and upper branch nodes are typically held in the database's buffer cache — its in-memory pool of recently accessed disk pages — so in practice the search often requires only 1–2 physical disk reads. Compare this to reading every row in a million-row table.
How Indexes Speed Up Reads#
The performance difference between an indexed and an unindexed query on a large table can be dramatic — often 100× to 1,000× faster for selective lookups.
PostgreSQL's query planner chooses between these two strategies automatically based on table statistics. You can see its decision with EXPLAIN:
-- Without index
EXPLAIN SELECT * FROM orders WHERE user_id = 42;
Seq Scan on orders (cost=0.00..18340.00 rows=5 width=64)
Filter: (user_id = 42)
-- After adding an index
CREATE INDEX orders_user_id_idx ON orders (user_id);
EXPLAIN SELECT * FROM orders WHERE user_id = 42;
Index Scan using orders_user_id_idx on orders (cost=0.43..20.51 rows=5 width=64)
Index Cond: (user_id = 42)
The estimated cost drops from 18340 to 20.51 — roughly a 900× improvement. The cost values in EXPLAIN output are not milliseconds; they are an arbitrary unit representing the planner's estimated work, roughly proportional to the number of disk page reads. What matters is the relative difference: the index scan costs a tiny fraction of the sequential scan.
Indexes also speed up several other operations you might not expect:
- JOIN operations: The database uses the index to find matching rows in the joined table, rather than scanning the entire joined table for each row in the driving table.
- ORDER BY clauses: A B-Tree is already sorted, so the database can read leaf nodes in order without a separate sort step.
- WHERE on UPDATE and DELETE: The database must locate rows before modifying them. An unindexed filter on a large table means a full scan before any row is touched.
The Cost: Indexes Slow Down Writes#
Indexes are not free. Every time you insert, update, or delete a row, every index on that table must also be updated. This is write amplification: one logical write to your data becomes multiple physical writes across all your indexes.
In PostgreSQL, the main table storage is called the heap — an unsorted file where rows are stored in no particular order. When you write a row, the database writes to the heap first, then must update every index that covers any column in that row.
| Write Operation | What Happens to Each Index | Hidden Cost |
|---|---|---|
| INSERT | A new entry is added to every index on the table. If the target leaf node is already full, it splits into two nodes — a change that may cascade up to branch nodes. | One INSERT = 1 heap write + N index writes, where N is the number of indexes on the table. |
| UPDATE (indexed column) | The old index entry is marked dead and a new entry is inserted at the correct sorted position. If the updated value moves far in the sort order, it may trigger additional node splits. | Effectively a delete plus an insert per affected index — more expensive than a plain INSERT. |
| UPDATE (non-indexed column) | PostgreSQL has an optimization called HOT (Heap Only Tuple): if the updated column is not part of any index, and the new row version fits on the same heap page, the database can skip all index updates entirely. | Adding more indexes to a table covers more columns, which blocks HOT updates more often — making even routine updates more expensive than they need to be. |
| DELETE | Index entries for the deleted row are marked as dead but remain in place. They are reclaimed later by VACUUM, PostgreSQL's background cleanup process. | Dead entries accumulate as index bloat until VACUUM runs. On write-heavy tables, bloat can grow faster than VACUUM can clean it up. |
| Bulk INSERT (data migration) | Every inserted row triggers index maintenance on all indexes simultaneously. | Standard practice for large data loads: drop all indexes first, load the data, then rebuild the indexes. This can be 10–100× faster than loading with indexes in place. |
The practical rule: A table with 8 indexes incurs roughly 8× more index write I/O compared to a table with 1 index. This matters most for write-heavy workloads — event logs, audit trails, message queues, telemetry ingestion. On these tables, over-indexing can significantly degrade write throughput under load, sometimes more than the read benefit justifies.
Index Types#
Most of the time, a standard B-Tree index is all you need. But PostgreSQL offers specialized index types suited to different access patterns — and knowing when to reach for them is a mark of a careful engineer.
| Index Type | How It Works | Best For | Limitations |
|---|---|---|---|
| B-Tree (default) | Balanced tree with sorted leaf nodes; created by CREATE INDEX with no USING clause. | Equality (=), ranges (<, >, BETWEEN, IN), sorting (ORDER BY), prefix patterns (LIKE 'prefix%'), IS NULL. | Not efficient for unanchored patterns like LIKE '%suffix'; not suitable for full-text search. |
| Hash | Stores a 32-bit hash of the column value; created with CREATE INDEX ... USING HASH. | Equality-only (=) lookups where range queries and ordering are never needed. | Cannot support <, >, BETWEEN, ORDER BY, or unique constraints. Since B-Tree handles equality just as efficiently and is more general, Hash indexes are rarely preferred in practice. |
| Partial | Indexes only the rows that satisfy a WHERE predicate; created with CREATE INDEX ... WHERE condition. | Tables where queries almost always target a specific subset of rows: unprocessed jobs, active users, pending payments. | Only used when the query's WHERE clause logically implies the index predicate. Parameterized queries may not qualify at plan time, since the planner evaluates the predicate before knowing the parameter value. |
Covering (INCLUDE) | Adds extra columns to the leaf nodes without making them part of the search key; created with CREATE INDEX ... INCLUDE (col2, col3). | Read-heavy queries that only need the columns stored in the index — enables index-only scans that retrieve data entirely from the index, without touching the main table. | Extra columns add index storage overhead. Included columns cannot be used for filtering or ordering — they are carried along for retrieval only. |
| Full-Text (GIN) | An inverted index that maps each word or token to the list of rows containing it; created with CREATE INDEX ... USING GIN (tsvector_column). | Full-text search using the @@ operator with to_tsvector and to_tsquery. | Higher write overhead than B-Tree. Not useful for exact-match or range queries on structured data. |
A note on partial indexes: These are underused but extremely effective. Consider a jobs table with 10 million rows, where 9.9 million are marked completed = TRUE. Nearly all queries target the 100,000 unfinished jobs. A partial index — CREATE INDEX ON jobs (created_at) WHERE completed = FALSE — contains entries for only those 100,000 rows. It is 99% smaller than a full index and far faster to scan. AI agents almost never suggest partial indexes; you have to recognize the pattern yourself from the table's access characteristics.
Composite Indexes: Column Order Matters#
A composite index covers more than one column: CREATE INDEX ON orders (user_id, status, created_at). These are among the most powerful optimization tools, but they only work if you understand one critical rule: the leftmost prefix rule.
Composite Index: The Leftmost Prefix Rule
A composite index on (user_id, status, created_at) sorts entries by user_id first, then by status within each user_id group, then by created_at within each status group. A query can use this index efficiently only by constraining the leading columns first — skipping the first column leaves the index unusable for that query.
Column ordering principle: Place the most selective equality-filter column first, followed by other equality filters, then range-filter columns such as dates or prices. Consider the query WHERE user_id = 42 AND created_at > '2024-01-01'. The index (user_id, created_at) is far more effective than (created_at, user_id) — because the equality constraint on user_id eliminates all other users' rows before the range scan on created_at even begins. Reversing the order forces the database to scan every row where created_at > '2024-01-01' across all users, then filter by user_id afterward — a much larger scan.
Selectivity: Why Column Choice Matters as Much as Column Order#
Even with perfect column ordering, an index on the wrong column provides no benefit. Selectivity describes how precisely an index condition narrows the matching rows. A highly selective condition matches very few rows: WHERE email = 'alice@example.com' returns exactly one row from a million-row table. A low-selectivity condition matches most of the table: WHERE is_active = TRUE might match 95% of rows.
Cardinality — the number of distinct values in a column — is the key indicator of selectivity potential:
| Column Example | Cardinality | Selectivity | Good Index Candidate? |
|---|---|---|---|
users.email | ~1M distinct values in 1M rows | Very high — each lookup returns exactly 1 row | Yes — the primary use case for indexes |
orders.user_id | ~10K distinct users across 1M orders | High — each value matches ~100 rows on average | Yes — a commonly queried foreign key column |
orders.created_at | Many distinct timestamps | High for precise timestamps; moderate for date ranges | Yes — especially useful for time-range queries like the last 30 days |
users.country | ~200 distinct values in 1M rows | Medium — each value matches ~5,000 rows | Maybe — only if individual-country queries are frequent and latency-sensitive |
orders.status | 5 distinct values (pending, shipped, etc.) | Low — each value matches ~20% of rows | Rarely useful as a standalone index. Consider a partial index if one specific status value is queried far more than others. |
users.is_active | 2 distinct values (true/false) | Very low — true matches ~95% of rows | No — an index on a boolean is almost never chosen by the planner |
The query planner estimates selectivity using statistics it collects automatically when you run ANALYZE (or when the autovacuum daemon runs in the background). When the planner estimates that a condition matches more than roughly 5–10% of the table, it typically prefers a sequential scan over an index scan. The reason is physical: a sequential scan reads heap pages in order, one after another, which is efficient for spinning disks and easy for the OS to prefetch. An index scan on a large fraction of rows jumps between many scattered heap pages, which generates far more random I/O than the sequential scan it was meant to replace. The exact threshold varies by hardware and PostgreSQL configuration, but the underlying principle is constant.
The low-cardinality trap: Adding an index on is_active adds write overhead to every INSERT and UPDATE — but the planner ignores it for WHERE is_active = TRUE because returning 95% of the table through random index lookups is slower than a sequential scan. The index wastes space and slows writes without helping any read. This is the most common over-indexing mistake, and it is one AI agents make regularly by indexing columns that appear in WHERE clauses without considering how many rows those conditions actually match.
When NOT to Index#
More indexes is not always better. Each index is a maintenance burden: more write amplification, more storage consumed, and more work for VACUUM. Before adding an index, verify that the query planner will actually use it.
| Situation | Why the Index Does Not Help | What to Do Instead |
|---|---|---|
| Low-cardinality column (boolean, status with 2–5 values) | The planner estimates the condition matches too many rows and prefers a sequential scan — the index exists but is never chosen | Use a partial index if only a small, specific subset is queried (e.g., WHERE completed = FALSE, where only 1% of rows are incomplete) |
| Very small table (under ~1,000 rows) | The entire table fits on one or two disk pages — a sequential scan reads it in a single I/O, which is faster than navigating the index structure | Skip the index; the planner will choose the sequential scan regardless of whether one exists |
| Write-heavy table with rarely-queried columns | The index adds overhead to every INSERT, UPDATE, and DELETE but is never used for reads | Find unused indexes by querying pg_stat_user_indexes WHERE idx_scan = 0 and drop them |
| Column already covered by a composite index's leading prefix | A separate single-column index on user_id is redundant if an index on (user_id, status) already exists and the planner can use it for user_id-only queries | Run EXPLAIN to verify before adding a new index; redundant indexes waste write I/O for no read benefit |
| Primary key and unique constraint columns | PostgreSQL automatically creates a unique B-Tree index for every PRIMARY KEY and UNIQUE constraint when the table is created | Do not manually create duplicate indexes on these columns — the index already exists |
What AI Agents Get Wrong About Indexes#
AI coding agents produce syntactically correct schemas and queries — but they consistently miss the indexes that make those schemas perform in production. The failure patterns are predictable.
The Missing Foreign Key Index
PostgreSQL automatically creates an index on the PRIMARY KEY column. It does NOT create one on foreign key columns. AI agents routinely generate foreign key declarations without the corresponding indexes — every JOIN or filter through that foreign key then becomes a full sequential scan on the child table, regardless of how large it grows.
Other common AI index omissions:
-
No composite index for multi-column filters. A table always queried as
WHERE tenant_id = ? AND status = 'active'needs a composite index on(tenant_id, status). AI agents generate the schema without reasoning about access patterns — identifying and specifying the right composite indexes is your responsibility. -
No partial index for small, frequently-queried subsets. A
jobstable almost always queried withWHERE completed = FALSEbenefits from a partial index —CREATE INDEX ON jobs (created_at) WHERE completed = FALSE. AI agents generate full indexes or no indexes at all; partial indexes require you to recognize the pattern from the table's usage, not just its columns. -
No covering index for hot read paths. A query that always fetches
(id, name, email)from a users table benefits from a covering index —CREATE INDEX ON users (id) INCLUDE (name, email)— enabling an index-only scan that never touches the heap. AI agents rarely identify this optimization because it requires understanding the specific columns a query projects, not just the columns it filters on.
Finding Missing Indexes in Production#
When queries are slow, the first question to ask is: "Are my queries actually using indexes?" PostgreSQL gives you the tools to answer it directly.
EXPLAIN ANALYZE shows the actual execution plan and per-node timing for any query. Add ANALYZE to get real execution times rather than estimates:
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 42;
Seq Scan on orders (cost=0.00..18340.00 rows=5 width=64)
(actual time=0.05..234.12 rows=5 loops=1)
Filter: (user_id = 42)
Rows Removed by Filter: 999995
Key signals to look for in the output:
Seq Scanwith aFilter:clause on a large table — the filter is not using any index; this is the clearest signal that an index is missingRows Removed by Filter: 999995— the engine examined nearly a million rows to return 5; a missing index on a highly selective column- High
actual timeon a simple lookup — almost always a missing index or stale statistics (runANALYZE tablenameto refresh the planner's statistics) Index Scan— an index is being used; check the cost to confirm it's substantially lower than a sequential scan would beIndex Only Scan— the best possible outcome: the query was answered entirely from the index without touching the main table at all
After adding the index, run EXPLAIN again to confirm the plan changed:
CREATE INDEX orders_user_id_idx ON orders (user_id);
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 42;
Index Scan using orders_user_id_idx on orders (cost=0.43..20.51 rows=5 width=64)
(actual time=0.04..0.09 rows=5 loops=1)
Index Cond: (user_id = 42)
Finding unused indexes — indexes that silently add write overhead without helping any read:
-- Indexes that have never been used since the last statistics reset
SELECT schemaname, relname AS table, indexrelname AS index, idx_scan AS times_used
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY relname, indexrelname;
An index with times_used = 0 after weeks of representative production traffic is a strong candidate for removal — it consumes storage and slows every write without benefiting any query. Allow at least several days of production workload before drawing conclusions, since these statistics reset on server restart.
Finding slow queries using pg_stat_statements:
pg_stat_statements tracks cumulative execution statistics for every distinct query pattern, making it easy to identify which queries are costing the most time across all their invocations.
-- pg_stat_statements must be listed in shared_preload_libraries in postgresql.conf
-- and the server restarted before this extension can be created.
-- Ask your DBA or check your cloud provider's documentation to enable it.
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Top 10 queries by total execution time across all calls
SELECT query, calls, total_exec_time, mean_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
Prioritize by total execution time rather than per-call time. A query that takes 50ms but runs 100,000 times per day accumulates 5,000 seconds of database work — far more impactful than a 5-second query that runs twice a week. This is where you find the indexes that make the biggest real-world difference.
Summary#
| Concept | Key Takeaway | AI Coding Implication |
|---|---|---|
| What an index is | A separate, sorted data structure that lets the database jump to matching rows instead of scanning every row — like a book index | AI agents generate schemas without analyzing which columns will be queried; you must specify index requirements explicitly |
| B-Tree structure | Root → branch → leaf nodes; leaf nodes are linked in sorted order for range scans; 4–5 levels handles millions of rows with consistent O(log n) performance | The default index type — CREATE INDEX always creates a B-Tree unless you specify USING |
| Read performance | Turns O(n) sequential scans into O(log n) tree traversals — often a 100–1,000× improvement for selective queries | Use EXPLAIN ANALYZE to verify that AI-generated queries actually use indexes before deploying to production |
| Write cost | Every index adds overhead to INSERT, UPDATE, and DELETE — write amplification scales with the number of indexes on the table | Over-indexing write-heavy tables is a real risk; AI agents don't account for write patterns when generating schemas |
| Foreign key indexes | PostgreSQL does NOT automatically index foreign key columns — you must add them manually after declaring the constraint | The most common AI index omission — check every foreign key in AI-generated schemas and add an index unless you have a specific reason not to |
| Composite indexes | Column order determines which queries can use the index: equality filters first, range filters last; skipping the leading column makes the index unusable | AI agents generate schemas without modeling access patterns — define your query patterns first, then specify index column order based on those patterns |
| Selectivity | Index high-cardinality columns (IDs, emails, timestamps); avoid indexing low-cardinality columns (booleans, status fields with a handful of values) | Indexing is_active or a 5-value status column is usually counterproductive — the planner ignores the index and the extra write overhead remains |
| Partial indexes | Index only the rows that queries actually target — produces a smaller, faster, more targeted index than a full-table index | Valuable for queues, task tables, and any scenario where queries consistently target a well-defined minority of rows; AI agents don't suggest them |
| Finding missing indexes | EXPLAIN ANALYZE reveals sequential scans and high row-removal counts; pg_stat_user_indexes reveals unused indexes; pg_stat_statements reveals the costliest queries | Run EXPLAIN on every critical query after AI generates the schema — this is a non-negotiable review step before deploying to production |
The engineers who get the most out of AI coding tools are not the ones who accept generated schemas without review. They are the ones who understand access patterns well enough to recognize which indexes the AI missed — and add them before the system reaches production.
Sources: