Notification System

A notification system delivers timely alerts to users across multiple channels: a push notification on your phone's lock screen, an email confirming a purchase, or an SMS with a one-time login code. Behind these seemingly simple messages lies one of the most reliability-demanding problems in distributed systems.

You have interacted with notification systems every day — the "your order has shipped" email from Amazon, the "someone liked your post" badge on Instagram, the security code text from your bank. What makes these systems deceptively hard is a combination of three factors: scale (billions of notifications per day across a global user base), channel diversity (each delivery channel — push, email, SMS — has its own protocol, rate limits, and failure modes), and reliability requirements (important notifications must eventually arrive even when devices are offline, providers are down, or the sending service crashes mid-delivery).

This case study answers a question every developer eventually faces: why do some notifications arrive late, or not at all? The answer reveals the core architectural decisions behind every production notification system.

This case study in this section follows the same framework:

Clarify constraints (Steps 1–2) — What does the system do, and how much traffic must it handle?
High-level design (Step 3) — What are the major components and how do they connect?
Deep dives (Steps 4–7) — How do the trickiest parts actually work?
Trade-offs (Step 8) — What did we give up, and when would we choose differently?

Step 1: Clarify Requirements#

Before designing anything, pin down exactly what the system needs to do and how well it needs to do it.

Functional Requirements#

These describe what the system does.

Feature	Description	Priority
Send push notifications	Deliver alerts to iOS devices via Apple Push Notification service (APNs) and Android devices via Firebase Cloud Messaging (FCM)	Core
Send emails	Deliver transactional and marketing emails via providers like SendGrid	Core
Send SMS	Deliver text messages via providers like Twilio for OTP codes and alerts	Core
In-app notifications	Show real-time alerts inside the app via WebSockets when users are active	Core
User preferences	Let users opt in or out per notification type and per channel; enforce Do-Not-Disturb hours	Core
Delivery tracking	Record the status of every notification: queued, sent, delivered, or failed	Core
Notification templates	Store reusable message templates with variable substitution (`Hi {{first_name}}`)	Core
Rate limiting	Cap how many notifications a user receives per day or week, per channel	Core
Scheduled notifications	Send a notification at a specific future time, such as a daily digest at 9am local time	Optional

Non-Functional Requirements#

These describe how well the system works.

Property	Requirement	Why It Matters
Reliability	At-least-once delivery: every notification must be attempted even if a server crashes mid-send	A missed security alert or password reset is a trust-breaking failure — silent drops are worse than delays
Latency	Transactional notifications (OTP, security alerts) delivered within seconds; promotional within minutes	Users abandon password reset flows if the SMS code takes more than 30 seconds to arrive
Scalability	Handle bursts from under 100 to over 100,000 notifications/second during marketing campaigns	A promotional blast to all users must not crash the system or delay critical alerts
Availability	99.9%+ uptime on the notification pipeline; failures must be queued, not dropped	A downed notification pipeline means every downstream service loses its alerting capability simultaneously
Idempotency	Sending the same notification twice must not deliver it twice to the user	At-least-once delivery makes duplicates possible; idempotency prevents them from reaching the user
Observability	Full audit trail per notification: when it was sent, which provider was used, and what result was returned	Debugging 'I never received my OTP' requires a complete per-notification event log

The central tension in this system: reliability and latency pull in opposite directions. The most reliable delivery mechanism — store in a queue, retry until success — introduces latency. The lowest-latency mechanism — a direct synchronous API call — has no retry and fails silently if the provider is down. The architecture must resolve this tension differently for different notification types: transactional notifications (OTPs, security alerts) need a fast, high-priority path while promotional campaigns can tolerate minutes of delay in exchange for guaranteed eventual delivery.

Step 2: Back-of-the-Envelope Estimation#

Metric	Calculation	Result
Daily active users (DAU)	Assumed	50 million
Notifications per user per day	5 on average across push, email, and in-app	—
Total notifications per day	50M × 5	250 million/day
Average throughput	250M ÷ 86,400 seconds/day	~2,900 notifications/second
Peak throughput (3× average)	2,900 × 3 — a marketing campaign blast triggers a sudden spike	~8,700 notifications/second
Storage per notification log row	~500 bytes (IDs, status codes, timestamps, payload hash)	—
Log storage per day	250M × 500 bytes	~125 GB/day
Log storage per 30 days	125 GB × 30	~3.75 TB/month
Push workers needed at peak	1 worker handles ~100/sec (APNs HTTP/2 with ~5 concurrent in-flight requests per connection at 50ms round-trip each)	~90 push workers at peak; auto-scale on queue depth

The key insight from the math: 8,700 notifications/second is not uniformly distributed. Marketing campaigns generate sudden bursts — traffic can spike from near zero to tens of thousands per second in minutes. This load profile is hostile to any architecture that calls provider APIs synchronously in the request path. The message queue is not an optimization; it is the foundational requirement that makes the system possible.

Step 3: High-Level Design#

The fundamental design principle is to decouple generation from delivery: the service that triggers a notification should not wait for it to be sent. An order service, an auth service, or a scheduler fires a notification request and immediately moves on. The slow, retry-heavy work of actually delivering it happens asynchronously, downstream, completely invisible to the caller.

Rendering diagram...

What each component does:

Notification Service — The single entry point for all notification requests. It validates the request, fetches user contact info (device tokens, email address, phone number), queries the User Preference Service to check for opt-outs and DND windows, then publishes an event to the appropriate Kafka topic. It returns 202 Accepted immediately — delivery is asynchronous.
User Preference Service — Answers the question: "Should this notification be sent to this user on this channel?" It stores per-user, per-channel opt-in/out flags, Do-Not-Disturb windows, and per-category rate limit counters backed by Redis.
Message Broker (Kafka) — The backbone of reliability. Kafka is a distributed, append-only log system: events are written to disk and persist until a consumer explicitly acknowledges them. Workers are organized into consumer groups, so if a worker crashes mid-processing, Kafka re-delivers that message to another worker in the same group. A separate Kafka topic per channel means an email provider outage cannot block SMS delivery.
Channel Workers — Independently scaled consumer groups, one per channel. Each worker reads an event, renders the message template with user-specific variables, calls the third-party provider API, and writes the result to the event log.
Notification Event Log (Cassandra) — An append-only audit table that records every delivery attempt with its status, provider, and timestamps. Cassandra is a distributed database optimized for high write throughput — it can sustain hundreds of thousands of writes per second across many nodes without degrading read performance. This event log powers debugging ("why didn't User A receive their OTP?"), analytics, and delivery confirmation.

API Design#

Endpoint	Method	Request Body	Response
`POST /api/v1/notifications`	POST	`{ request_id, template_id, channels[], recipient: { user_id }, variables, priority }`	`202 Accepted` — `{ notification_id, status: 'queued' }`
`GET /api/v1/notifications/{id}`	GET	—	`{ notification_id, status, channel, provider, sent_at, delivered_at }`
`PUT /api/v1/users/{id}/preferences`	PUT	`{ preferences[], dnd_start, dnd_end, timezone }`	`200 OK`
`POST /api/v1/users/{id}/devices`	POST	`{ token, platform: 'ios' \| 'android' \| 'web', app_version }`	`201 Created`

Why 202 Accepted and not 200 OK? The 202 status code means "the request was received and will be processed, but processing is not yet complete." This is the correct HTTP response for asynchronous operations. Returning 200 OK would falsely imply the notification was delivered at the moment the API responded.

Why request_id in the body? This is the caller's idempotency key. If the upstream service retries the request due to a network timeout, the same request_id ensures the notification is not enqueued and delivered twice. We explore idempotency in detail in Step 6.

Database Schema#

The system uses two databases with distinct roles: PostgreSQL for structured configuration and preference data, and Cassandra for the high-volume, append-only event log.

-- Device tokens for push notifications
CREATE TABLE device_tokens (
  id          BIGSERIAL PRIMARY KEY,
  user_id     BIGINT REFERENCES users(id) ON DELETE CASCADE,
  token       VARCHAR(512) NOT NULL,
  platform    VARCHAR(10) NOT NULL,   -- 'ios', 'android', 'web'
  is_active   BOOLEAN DEFAULT true,
  last_seen   TIMESTAMPTZ,
  UNIQUE(user_id, token)
);

-- Per-user, per-channel notification preferences
CREATE TABLE notification_preferences (
  user_id           BIGINT REFERENCES users(id) ON DELETE CASCADE,
  notification_type VARCHAR(100) NOT NULL, -- 'order_update', 'promo', 'security_alert'
  channel           VARCHAR(20) NOT NULL,  -- 'push', 'email', 'sms', 'in_app'
  is_enabled        BOOLEAN DEFAULT true,
  PRIMARY KEY (user_id, notification_type, channel)
);

-- Reusable message templates with variable substitution
CREATE TABLE notification_templates (
  id            BIGSERIAL PRIMARY KEY,
  name          VARCHAR(100) UNIQUE NOT NULL,
  channel       VARCHAR(20) NOT NULL,
  subject       VARCHAR(255),             -- email subject line only
  body_template TEXT NOT NULL,            -- "Hi {{first_name}}, your order {{order_id}} shipped!"
  type          VARCHAR(50) NOT NULL,     -- 'transactional', 'promotional', 'security'
  version       INT DEFAULT 1,
  is_active     BOOLEAN DEFAULT true
);

The Cassandra event log stores one row per delivery attempt:

notification_events (Cassandra):
  notification_id   UUID       -- partition key
  user_id           UUID       -- secondary index for per-user queries
  channel           TEXT       -- 'push', 'email', 'sms'
  status            TEXT       -- 'queued', 'sent', 'delivered', 'failed', 'suppressed'
  provider          TEXT       -- 'apns', 'fcm', 'sendgrid', 'twilio'
  provider_msg_id   TEXT       -- the provider's own message ID (for deduplication)
  idempotency_key   TEXT
  created_at        TIMESTAMP
  error_code        TEXT       -- '410', '429', '500', etc.

Why Cassandra for the event log? The event log is append-only and write-heavy — it grows continuously at 125 GB/day. Cassandra is designed for exactly this pattern: high write throughput with horizontal scaling by partition key (the field that determines which node stores a given row). PostgreSQL can store this data, but at this write volume it would require significant resources that would compete with the latency-sensitive preference and device queries that need fast reads. Cassandra lets us scale writes independently of the transactional PostgreSQL workload.

Step 4: Deep Dive — The Message Queue Pipeline#

The message broker is what separates a system that tries to deliver notifications from one that guarantees delivery attempts. This section explains how the pipeline works and why every design decision matters.

The Notification Delivery Pipeline

Each notification travels through a queue before reaching a worker, which renders the template and calls the provider. The queue is the durability guarantee: messages survive worker crashes, provider outages, and deployment restarts. Each channel gets its own Kafka topic so a failure in one channel cannot block another.

Rendering diagram...

Template Rendering#

Templates decouple message content from sending logic. A worker receiving { templateId: 'order_shipped_push', variables: { first_name: 'Alice', order_id: 'ORD-123' } } follows these steps:

Fetch the template from Redis cache — Redis is an in-memory key-value store that returns data in under a millisecond, far faster than a database query. On a cache miss, fall back to the database and populate the cache for subsequent requests.
Fetch the user's locale and timezone
Select the matching locale variant if available (en-US, fr-FR, ja-JP)
Substitute variables: "Hi {{first_name}}, your order {{order_id}} shipped!" → "Hi Alice, your order ORD-123 shipped!"
Format for channel: trim to ≤160 characters for SMS, build full HTML for email, validate ≤4 KB for push
Send to the provider

Storing templates in the database — rather than hard-coding them in application code — lets content teams update message copy, add locale variants, and run A/B tests without a code deployment or engineering review. Templates also carry a version field so you can roll back to a known-good version immediately if a change causes unexpected rendering behavior in production.

In-App Notifications and WebSocket Delivery#

Push, email, and SMS all deliver notifications to users who are outside the app. In-app notifications work differently — they are delivered in real time to users who are actively using the app right now, and they use WebSockets instead of a third-party provider.

Why WebSocket and not HTTP polling? With regular HTTP, the client must ask ("is there anything new?") and the server responds — the client has to keep asking on a timer (polling). WebSocket flips this: it establishes a persistent, bidirectional connection, so the server can push data to the client at any moment without waiting to be asked. For real-time notifications that should appear within a second of the triggering event, WebSocket is the right tool.

How in-app delivery works:

When a user opens the app, the client establishes a WebSocket connection to a WebSocket server instance.
The WebSocket server records the active connection in Redis: SET ws:user:{user_id} {server_id}. This is necessary because you run multiple WebSocket server instances in production — a notification event needs to reach the specific instance that holds a given user's connection.
When the In-App Worker receives a notification event from Kafka, it looks up ws:user:{user_id} in Redis. If the key exists, the user is online — the worker forwards the event to the correct WebSocket server, which delivers it in real time.
If the key does not exist, the user is offline — the notification is stored in the event log with status pending. When the user opens the app, the client fetches recent pending notifications via a standard REST API call, which is the fan-out on read pattern discussed in Step 7.

This two-path design (real-time WebSocket for active users, REST fetch for returning users) means in-app notifications work correctly regardless of whether the user is currently in the app.

Step 5: Deep Dive — Push Notifications and Why They Fail#

This section answers the question at the heart of this case study: why do some notifications arrive late, or not at all?

Push Notification Delivery: APNs and FCM

Push notifications travel through a two-hop path: your server calls APNs (Apple) or FCM (Google), which then delivers to the device. Your server never has a direct connection to the device. This indirection enables offline delivery — but it also introduces failure modes that are invisible until you know where to look.

Rendering diagram...

Why Notifications Fail: The Complete Failure Map#

Understanding failure modes is what separates a system that works in testing from one that holds up in production.

Failure Mode	Root Cause	Provider Signal	Fix
Stale device token	App was uninstalled or the device was reset	APNs: `410 Gone`. FCM: `NotRegistered`	Delete the token from `device_tokens` immediately — do not retry with the same token
Provider rate limiting	Sending requests faster than the provider's per-project quota allows	HTTP `429 Too Many Requests`	Exponential backoff with jitter; use separate API keys for critical vs. promotional traffic to protect critical quota
Provider outage	APNs, SendGrid, or Twilio is experiencing an incident	HTTP `5xx` errors on provider calls	Circuit breaker: stop sending after N consecutive failures, route to a backup provider (e.g., switch from SendGrid to Mailgun), probe periodically for recovery
Oversized payload	Push notification payload exceeds the 4 KB limit	APNs `400` with `PayloadTooLarge`	Validate payload size before enqueueing; send a 'content-available' push that tells the app to fetch the full content from your API
Worker crash mid-send	The worker process dies after calling the provider but before ACKing Kafka	Kafka re-delivers the message to another worker	Idempotency key prevents the duplicate delivery from reaching the user (covered in Step 6)
Device offline	User's device has no network connection	Provider returns 200 (message queued)	APNs and FCM handle this automatically; set an expiry time so time-sensitive notifications are discarded if not delivered within a window
DND / quiet hours	Notification arrives at 3am in the user's local timezone	Notification delivered but user missed it	Check user timezone and DND preferences before enqueueing; delay to the next allowed window rather than sending and being ignored
Template variable missing	A required variable like `{{order_id}}` is absent from the payload	Message renders as raw placeholder text	Validate that all required template variables are present at the API layer — fail fast with a 400 error rather than enqueuing a broken message

Step 6: Deep Dive — Reliability: At-Least-Once Delivery and Idempotency#

The message broker's guarantee — at-least-once delivery — is both the system's greatest reliability feature and its most dangerous source of bugs if unhandled.

At-least-once delivery means a message stays in the queue until a consumer explicitly acknowledges it. If a worker crashes between sending a notification and acknowledging the Kafka message, the broker treats the message as unprocessed and re-delivers it to another available worker. This is correct behavior for reliability — but it means the same notification could be processed more than once, producing a duplicate delivery to the user. This is the core tension: the very mechanism that prevents silent drops also introduces the risk of duplicates.

Idempotency: Preventing Duplicate Notifications

The idempotency key pattern uses Redis as a short-lived deduplication checkpoint. Before calling any provider, the worker performs an atomic check-and-set: using Redis's SET NX (set if not exists) command, it either claims the notification ID as its own or discovers another worker already handled it. If it has already been sent, the worker skips the provider call and acknowledges the Kafka message silently. If the worker crashes after the provider call but before acknowledging, Redis still holds the 'sent' marker, so the re-delivered event is safely suppressed on the next attempt.

Rendering diagram...

The Dead Letter Queue (DLQ)#

After a configurable number of retries (typically 3–5), a notification that keeps failing is moved to a Dead Letter Queue (DLQ) — a separate Kafka topic or queue where permanently failed messages are parked for human inspection.

The critical rule: never silently discard a failed notification. A dropped notification has no audit trail and cannot be replayed. A DLQ message carries the full original payload, can be inspected by an on-call engineer to diagnose the root cause, and can be manually replayed once the underlying issue is resolved. The difference between routing to a DLQ and silently deleting the message is the difference between a diagnosable incident and an unexplained gap in the delivery logs.

Step 7: Deep Dive — Fan-Out for Mass Notifications#

The fan-out problem is: how do you efficiently deliver one notification event to millions of recipients? When an e-commerce platform sends a flash sale alert to all opted-in users, it may need to dispatch 10 million push notifications in under 5 minutes. The naive approach — query all users in a loop, send one by one — would not come close. With an average provider round-trip of 50 ms, a single thread sending 10 million notifications sequentially would take roughly 139 hours. Fan-out is about designing around this bottleneck.

Fan-Out Strategies for Mass Notifications

Fan-out amplifies one notification event into millions of individual deliveries. The right strategy depends on two questions: do you need per-user delivery tracking, and what is the ratio of writes to reads? Production systems typically use all three patterns simultaneously for different channels and notification types.

Rendering diagram...

Step 8: Trade-offs and Production Reality#

Every design decision in this system involves a trade-off. Here is a consolidated view of the key choices and when you would decide differently.

Decision	Choice Made	What We Gave Up	Choose Differently When...
Delivery guarantee	At-least-once delivery + application-level idempotency keys	Exactly-once semantics, which eliminate duplicates at the broker layer but add 2–3x latency overhead	Never for notifications — at-least-once plus idempotency achieves the same user-visible result with far less complexity
Queue topology	One Kafka topic per channel; two priority tiers (critical vs. promotional)	Simpler single-queue setup with a channel type field	Start with a single queue if your team is new to Kafka; add channel isolation only when a channel failure has actually caused cross-channel impact in production
Fan-out strategy	Hybrid: pub/sub for push blasts, fan-out on write for email and SMS, fan-out on read for in-app	Operational simplicity of a single uniform strategy across all channels	Use fan-out on write for all channels if per-user delivery tracking is a regulatory requirement (e.g., financial notifications that must be audited)
Event log database	Cassandra for the append-only, high-write event log	SQL ACID guarantees and simpler ad-hoc querying for debugging	Use PostgreSQL for the event log if daily notification volume is under ~10 million/day; migrate when write throughput or storage growth becomes a bottleneck
Template storage	Database-backed templates with runtime updates	Compile-time type checking and validation of template variables (possible when templates live in code)	Keep templates in code if they are simple, change infrequently, and you want build-time guarantees that all variable references are valid
Provider redundancy	Circuit breaker with a single active provider per channel, failover to backup	Cost and maintenance of keeping a secondary provider account active and configured	Set up an active-active multi-provider setup for each channel if your SLA requires notifications to survive a primary provider outage without any manual intervention
Retry behavior	Exponential backoff with jitter; max 5 retries; move to DLQ on exhaustion	Faster initial re-delivery attempts (a flat 1-second retry interval would respond quicker after transient failures)	Reduce max retries to 3 for SMS where providers charge per attempt; increase the retry window for email where eventual delivery after an hour is better than no delivery at all

What this design intentionally excludes#

Production notification systems at major companies include additional layers that are out of scope for this case study:

Engagement analytics — open rates, click-through rates, and unsubscribe rates via provider webhooks, requiring a separate analytics ingestion pipeline
A/B testing on templates — routing a percentage of users to template variant B and measuring engagement to determine the winning message copy
Notification digesting — grouping multiple notifications (e.g., "5 new messages") into a single summary to reduce notification fatigue and OS-level throttling
Compliance tooling — one-click unsubscribe links in marketing emails (required by CAN-SPAM and GDPR), consent audit trails, and data deletion cascades that remove notification history
Multi-region delivery — routing push requests to the geographically closest APNs or FCM endpoint to reduce round-trip latency for global users

Each of these adds real complexity, but none fundamentally changes the queue-and-worker architecture described here.

Summary#

Component	Design Decision	Key Reasoning
Notification Service	Stateless API gateway: validates, enriches, enqueues, returns 202 Accepted immediately	Callers must never wait for delivery; 202 signals accepted-for-processing without implying the notification has been delivered
Message Broker	Kafka with one topic per channel and two priority tiers (critical and promotional)	Channel isolation prevents cross-channel failures; priority tiers protect OTP and security alerts from being delayed behind promotional campaign bursts
Channel Workers	Independent, auto-scaled consumer groups, one per channel	Worker failures are contained to one channel; each channel's worker count scales independently based on its queue depth
At-least-once + Idempotency	Kafka at-least-once delivery with a Redis idempotency key per notification	Guarantees that delivery attempts survive worker crashes without sending duplicate messages to users
Retry strategy	Exponential backoff with jitter; max 5 retries; Dead Letter Queue on exhaustion	Prevents thundering herd when a provider recovers from an outage; DLQ preserves all failed messages for diagnosis and manual replay
Fan-out	Hybrid: pub/sub for push blasts, fan-out on write (async) for email and SMS	Pure pub/sub is cheapest but provides no per-user tracking; fan-out on write gives a complete delivery audit trail at the cost of write amplification
Event Log	Cassandra append-only table, one row per delivery attempt	Cassandra sustains high append throughput at 125 GB/day without impacting the latency of the transactional PostgreSQL queries for preferences and device tokens
Stale token pruning	Delete device tokens immediately on APNs 410 Gone or FCM NotRegistered	Stale tokens accumulate silently — a token that was valid yesterday may become invalid tonight when a user upgrades their phone. Pruning on first failure is the only reliable mechanism
Templates	Stored in database with variable substitution, locale variants, and versioning	Runtime updates without code deployments; content teams manage copy independently; version rollback is possible when a template change causes unexpected behavior

The notification system is the queue-and-worker pattern in its most real-world form. Every design decision here — the message broker topology, the idempotency key, the fan-out strategy, the Dead Letter Queue — reappears in payment processing pipelines, event-driven microservices, and asynchronous job systems. The specific providers (APNs, FCM, SendGrid, Twilio) are interchangeable; the architectural pattern is universal.

The central insight: the queue is the reliability guarantee. Without it, a provider outage means lost notifications. With it, notifications accumulate during the outage and drain when the provider recovers — invisibly to the triggering service, and mostly invisibly to the user. This is the answer to why some notifications arrive late: a temporary queue backlog, not a permanent failure. And the answer to why some never arrive at all: a stale device token, a silent rate limit, or an unhandled error that was discarded without a retry.

Sources:

PreviousGlobal Rate Limiter

NextReal-time Chat App

Notification System

The Notification Delivery Pipeline

Push Notification Delivery: APNs and FCM

Idempotency: Preventing Duplicate Notifications

Fan-Out Strategies for Mass Notifications

Arch Advisor