System Design

Welcome to the System Design tutorial. This guide walks you through how to architect software systems that scale, stay reliable, and handle real-world load — with a modern lens on AI-powered applications and the engineering decisions that come with them.

Why Learn System Design in the AI Era?#

AI coding agents like Claude Code can scaffold an entire service, write migrations, and wire up infrastructure — all from a single prompt. But they operate within the architecture you define. Deciding whether to shard your database or add a read replica, choosing between a message queue and a synchronous call, estimating whether your system survives 10x traffic on the current design — that judgment still lives with you. The better you understand system design, the better you can direct AI agents to build the right thing, not just a working thing.

System design is the skill that turns code into production software. Without it, technically correct code still fails — it's just slow, expensive, or unreliable instead of broken. Understanding system design enables you to:

Direct AI agents effectively — Give AI the architectural context it needs to generate code that fits correctly into the larger system, not just code that compiles
Evaluate AI-generated architecture — Spot when a suggested design doesn't account for scale, cost, or failure modes
Make trade-off decisions — Choose between SQL and NoSQL, REST and WebSockets, monolith and microservices — based on actual requirements, not hype
Estimate before you build — Know roughly what something will cost and how it will perform before writing a single line of code
Design for the AI layer — Understand how context windows, token costs, caching, and model routing are first-class architectural concerns in 2026

Learning Path#

Topics build on each other. Start with Foundations to build the core mental models, move through the Building Blocks every production system uses, then go deeper into Reliability, Distribution, and the AI Infrastructure Layer. The curriculum closes with Trade-offs & Production Reality and then Case Studies that tie everything together.

Rendering diagram...

Topics#

1. Foundations#

The Mental Models

Move from "how code runs" to "how systems behave." Covers non-functional requirements, the CAP theorem, back-of-the-envelope estimation, and basic networking — the vocabulary you need before drawing any diagrams.

Start learning →

Topics covered:

Scalability vs. Performance vs. Reliability — three properties that are often confused but mean very different things
Latency vs. Throughput — and Time to First Token (TTFT) as the AI-specific latency metric
The CAP Theorem — why distributed systems must choose between Consistency and Availability when things go wrong
Back-of-the-envelope math — estimating RPS, storage, and AI token costs before writing any code
HTTP/3, WebSockets, DNS, and Global Server Load Balancing

2. Core Building Blocks#

Application Layer

Monolith vs. microservices — when to stay together and when to split.

System Design

The Mental Models

Application Layer

API Design

Storage

Caching & CDNs

Load Balancing & Traffic

Schema Design

Indexes

Normalization vs. Denormalization

Choosing a Storage Engine

Scaling Up vs. Scaling Out

Database Replication

Scaling Data

Consistency Challenges

Fault Tolerance

Sync vs. Async

Event-Driven Design

Coordination

Docker

Container Orchestration

CI/CD Pipelines

Deployment Strategies

Context Management

RAG

AI Agents

Agent Harness Engineering

Patterns in Claude Code

Zero-Trust Security

AI Security

Observability

Clarifying Requirements

Build vs. Buy

Simplicity vs. Scalability

The Cost of Complexity

What Diagrams Miss

Operational Excellence

URL Shortener

Global Rate Limiter

Notification System

Real-time Chat App

Social Media Feed

Video Streaming

Distributed File Storage

AI-Powered Customer Support (RAG)

Arch Advisor