System Design

Welcome to the System Design tutorial. This guide walks you through how to architect software systems that scale, stay reliable, and handle real-world load — with a modern lens on AI-powered applications and the engineering decisions that come with them.

Why Learn System Design in the AI Era?#

AI coding agents like Claude Code can scaffold an entire service, write migrations, and wire up infrastructure — all from a single prompt. But they operate within the architecture you define. Deciding whether to shard your database or add a read replica, choosing between a message queue and a synchronous call, estimating whether your system survives 10x traffic on the current design — that judgment still lives with you. The better you understand system design, the better you can direct AI agents to build the right thing, not just a working thing.

System design is the skill that turns code into production software. Without it, technically correct code still fails — it's just slow, expensive, or unreliable instead of broken. Understanding system design enables you to:

  • Direct AI agents effectively — Give AI the architectural context it needs to generate code that fits correctly into the larger system, not just code that compiles
  • Evaluate AI-generated architecture — Spot when a suggested design doesn't account for scale, cost, or failure modes
  • Make trade-off decisions — Choose between SQL and NoSQL, REST and WebSockets, monolith and microservices — based on actual requirements, not hype
  • Estimate before you build — Know roughly what something will cost and how it will perform before writing a single line of code
  • Design for the AI layer — Understand how context windows, token costs, caching, and model routing are first-class architectural concerns in 2026

Learning Path#

Topics build on each other. Start with Foundations to build the core mental models, move through the Building Blocks every production system uses, then go deeper into Reliability, Distribution, and the AI Infrastructure Layer. The curriculum closes with Trade-offs & Production Reality and then Case Studies that tie everything together.

Rendering diagram...
Rendering diagram...
Rendering diagram...
Rendering diagram...
Rendering diagram...

Topics#

1. Foundations#

The Mental Models

Move from "how code runs" to "how systems behave." Covers non-functional requirements, the CAP theorem, back-of-the-envelope estimation, and basic networking — the vocabulary you need before drawing any diagrams.

Start learning →

Topics covered:

  • Scalability vs. Performance vs. Reliability — three properties that are often confused but mean very different things
  • Latency vs. Throughput — and Time to First Token (TTFT) as the AI-specific latency metric
  • The CAP Theorem — why distributed systems must choose between Consistency and Availability when things go wrong
  • Back-of-the-envelope math — estimating RPS, storage, and AI token costs before writing any code
  • HTTP/3, WebSockets, DNS, and Global Server Load Balancing

2. Core Building Blocks#

3. Data Modeling#

4. Reliability & Scalability#

5. Distributed Communication#

6. Containers & Deployment#

7. The AI Infrastructure Layer#

8. Security & Observability#

9. Trade-offs & Production Reality#

10. System Design Case Studies#