Notification System
A notification system delivers timely alerts to users across multiple channels: a push notification on your phone's lock screen, an email confirming a purchase, or an SMS with a one-time login code. Behind these seemingly simple messages lies one of the most reliability-demanding problems in distributed systems.
You have interacted with notification systems every day — the "your order has shipped" email from Amazon, the "someone liked your post" badge on Instagram, the security code text from your bank. What makes these systems deceptively hard is a combination of three factors: scale (billions of notifications per day across a global user base), channel diversity (each delivery channel — push, email, SMS — has its own protocol, rate limits, and failure modes), and reliability requirements (important notifications must eventually arrive even when devices are offline, providers are down, or the sending service crashes mid-delivery).
This case study answers a question every developer eventually faces: why do some notifications arrive late, or not at all? The answer reveals the core architectural decisions behind every production notification system.
This case study in this section follows the same framework:
- Clarify constraints (Steps 1–2) — What does the system do, and how much traffic must it handle?
- High-level design (Step 3) — What are the major components and how do they connect?
- Deep dives (Steps 4–7) — How do the trickiest parts actually work?
- Trade-offs (Step 8) — What did we give up, and when would we choose differently?
Step 1: Clarify Requirements#
Before designing anything, pin down exactly what the system needs to do and how well it needs to do it.
Functional Requirements#
These describe what the system does.
| Feature | Description | Priority |
|---|---|---|
| Send push notifications | Deliver alerts to iOS devices via Apple Push Notification service (APNs) and Android devices via Firebase Cloud Messaging (FCM) | Core |
| Send emails | Deliver transactional and marketing emails via providers like SendGrid | Core |
| Send SMS | Deliver text messages via providers like Twilio for OTP codes and alerts | Core |
| In-app notifications | Show real-time alerts inside the app via WebSockets when users are active | Core |
| User preferences | Let users opt in or out per notification type and per channel; enforce Do-Not-Disturb hours | Core |
| Delivery tracking | Record the status of every notification: queued, sent, delivered, or failed | Core |
| Notification templates | Store reusable message templates with variable substitution (Hi {{first_name}}) | Core |
| Rate limiting | Cap how many notifications a user receives per day or week, per channel | Core |
| Scheduled notifications | Send a notification at a specific future time, such as a daily digest at 9am local time | Optional |
Non-Functional Requirements#
These describe how well the system works.
| Property | Requirement | Why It Matters |
|---|---|---|
| Reliability | At-least-once delivery: every notification must be attempted even if a server crashes mid-send | A missed security alert or password reset is a trust-breaking failure — silent drops are worse than delays |
| Latency | Transactional notifications (OTP, security alerts) delivered within seconds; promotional within minutes | Users abandon password reset flows if the SMS code takes more than 30 seconds to arrive |
| Scalability | Handle bursts from under 100 to over 100,000 notifications/second during marketing campaigns | A promotional blast to all users must not crash the system or delay critical alerts |
| Availability | 99.9%+ uptime on the notification pipeline; failures must be queued, not dropped | A downed notification pipeline means every downstream service loses its alerting capability simultaneously |
| Idempotency | Sending the same notification twice must not deliver it twice to the user | At-least-once delivery makes duplicates possible; idempotency prevents them from reaching the user |
| Observability | Full audit trail per notification: when it was sent, which provider was used, and what result was returned | Debugging 'I never received my OTP' requires a complete per-notification event log |
The central tension in this system: reliability and latency pull in opposite directions. The most reliable delivery mechanism — store in a queue, retry until success — introduces latency. The lowest-latency mechanism — a direct synchronous API call — has no retry and fails silently if the provider is down. The architecture must resolve this tension differently for different notification types: transactional notifications (OTPs, security alerts) need a fast, high-priority path while promotional campaigns can tolerate minutes of delay in exchange for guaranteed eventual delivery.
Step 2: Back-of-the-Envelope Estimation#
| Metric | Calculation | Result |
|---|---|---|
| Daily active users (DAU) | Assumed | 50 million |
| Notifications per user per day | 5 on average across push, email, and in-app | — |
| Total notifications per day | 50M × 5 | 250 million/day |
| Average throughput | 250M ÷ 86,400 seconds/day | ~2,900 notifications/second |
| Peak throughput (3× average) | 2,900 × 3 — a marketing campaign blast triggers a sudden spike | ~8,700 notifications/second |
| Storage per notification log row | ~500 bytes (IDs, status codes, timestamps, payload hash) | — |
| Log storage per day | 250M × 500 bytes | ~125 GB/day |
| Log storage per 30 days | 125 GB × 30 | ~3.75 TB/month |
| Push workers needed at peak | 1 worker handles ~100/sec (APNs HTTP/2 with ~5 concurrent in-flight requests per connection at 50ms round-trip each) | ~90 push workers at peak; auto-scale on queue depth |
The key insight from the math: 8,700 notifications/second is not uniformly distributed. Marketing campaigns generate sudden bursts — traffic can spike from near zero to tens of thousands per second in minutes. This load profile is hostile to any architecture that calls provider APIs synchronously in the request path. The message queue is not an optimization; it is the foundational requirement that makes the system possible.
Step 3: High-Level Design#
The fundamental design principle is to decouple generation from delivery: the service that triggers a notification should not wait for it to be sent. An order service, an auth service, or a scheduler fires a notification request and immediately moves on. The slow, retry-heavy work of actually delivering it happens asynchronously, downstream, completely invisible to the caller.
What each component does:
- Notification Service — The single entry point for all notification requests. It validates the request, fetches user contact info (device tokens, email address, phone number), queries the User Preference Service to check for opt-outs and DND windows, then publishes an event to the appropriate Kafka topic. It returns
202 Acceptedimmediately — delivery is asynchronous. - User Preference Service — Answers the question: "Should this notification be sent to this user on this channel?" It stores per-user, per-channel opt-in/out flags, Do-Not-Disturb windows, and per-category rate limit counters backed by Redis.
- Message Broker (Kafka) — The backbone of reliability. Kafka is a distributed, append-only log system: events are written to disk and persist until a consumer explicitly acknowledges them. Workers are organized into consumer groups, so if a worker crashes mid-processing, Kafka re-delivers that message to another worker in the same group. A separate Kafka topic per channel means an email provider outage cannot block SMS delivery.
- Channel Workers — Independently scaled consumer groups, one per channel. Each worker reads an event, renders the message template with user-specific variables, calls the third-party provider API, and writes the result to the event log.
- Notification Event Log (Cassandra) — An append-only audit table that records every delivery attempt with its status, provider, and timestamps. Cassandra is a distributed database optimized for high write throughput — it can sustain hundreds of thousands of writes per second across many nodes without degrading read performance. This event log powers debugging ("why didn't User A receive their OTP?"), analytics, and delivery confirmation.
API Design#
| Endpoint | Method | Request Body | Response |
|---|---|---|---|
POST /api/v1/notifications | POST | { request_id, template_id, channels[], recipient: { user_id }, variables, priority } | 202 Accepted — { notification_id, status: 'queued' } |
GET /api/v1/notifications/{id} | GET | — | { notification_id, status, channel, provider, sent_at, delivered_at } |
PUT /api/v1/users/{id}/preferences | PUT | { preferences[], dnd_start, dnd_end, timezone } | 200 OK |
POST /api/v1/users/{id}/devices | POST | { token, platform: 'ios' | 'android' | 'web', app_version } | 201 Created |
Why 202 Accepted and not 200 OK? The 202 status code means "the request was received and will be processed, but processing is not yet complete." This is the correct HTTP response for asynchronous operations. Returning 200 OK would falsely imply the notification was delivered at the moment the API responded.
Why request_id in the body? This is the caller's idempotency key. If the upstream service retries the request due to a network timeout, the same request_id ensures the notification is not enqueued and delivered twice. We explore idempotency in detail in Step 6.
Database Schema#
The system uses two databases with distinct roles: PostgreSQL for structured configuration and preference data, and Cassandra for the high-volume, append-only event log.
-- Device tokens for push notifications
CREATE TABLE device_tokens (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT REFERENCES users(id) ON DELETE CASCADE,
token VARCHAR(512) NOT NULL,
platform VARCHAR(10) NOT NULL, -- 'ios', 'android', 'web'
is_active BOOLEAN DEFAULT true,
last_seen TIMESTAMPTZ,
UNIQUE(user_id, token)
);
-- Per-user, per-channel notification preferences
CREATE TABLE notification_preferences (
user_id BIGINT REFERENCES users(id) ON DELETE CASCADE,
notification_type VARCHAR(100) NOT NULL, -- 'order_update', 'promo', 'security_alert'
channel VARCHAR(20) NOT NULL, -- 'push', 'email', 'sms', 'in_app'
is_enabled BOOLEAN DEFAULT true,
PRIMARY KEY (user_id, notification_type, channel)
);
-- Reusable message templates with variable substitution
CREATE TABLE notification_templates (
id BIGSERIAL PRIMARY KEY,
name VARCHAR(100) UNIQUE NOT NULL,
channel VARCHAR(20) NOT NULL,
subject VARCHAR(255), -- email subject line only
body_template TEXT NOT NULL, -- "Hi {{first_name}}, your order {{order_id}} shipped!"
type VARCHAR(50) NOT NULL, -- 'transactional', 'promotional', 'security'
version INT DEFAULT 1,
is_active BOOLEAN DEFAULT true
);
The Cassandra event log stores one row per delivery attempt:
notification_events (Cassandra):
notification_id UUID -- partition key
user_id UUID -- secondary index for per-user queries
channel TEXT -- 'push', 'email', 'sms'
status TEXT -- 'queued', 'sent', 'delivered', 'failed', 'suppressed'
provider TEXT -- 'apns', 'fcm', 'sendgrid', 'twilio'
provider_msg_id TEXT -- the provider's own message ID (for deduplication)
idempotency_key TEXT
created_at TIMESTAMP
error_code TEXT -- '410', '429', '500', etc.
Why Cassandra for the event log? The event log is append-only and write-heavy — it grows continuously at 125 GB/day. Cassandra is designed for exactly this pattern: high write throughput with horizontal scaling by partition key (the field that determines which node stores a given row). PostgreSQL can store this data, but at this write volume it would require significant resources that would compete with the latency-sensitive preference and device queries that need fast reads. Cassandra lets us scale writes independently of the transactional PostgreSQL workload.
Step 4: Deep Dive — The Message Queue Pipeline#
The message broker is what separates a system that tries to deliver notifications from one that guarantees delivery attempts. This section explains how the pipeline works and why every design decision matters.
The Notification Delivery Pipeline
Each notification travels through a queue before reaching a worker, which renders the template and calls the provider. The queue is the durability guarantee: messages survive worker crashes, provider outages, and deployment restarts. Each channel gets its own Kafka topic so a failure in one channel cannot block another.
Template Rendering#
Templates decouple message content from sending logic. A worker receiving { templateId: 'order_shipped_push', variables: { first_name: 'Alice', order_id: 'ORD-123' } } follows these steps:
- Fetch the template from Redis cache — Redis is an in-memory key-value store that returns data in under a millisecond, far faster than a database query. On a cache miss, fall back to the database and populate the cache for subsequent requests.
- Fetch the user's locale and timezone
- Select the matching locale variant if available (
en-US,fr-FR,ja-JP) - Substitute variables:
"Hi {{first_name}}, your order {{order_id}} shipped!"→"Hi Alice, your order ORD-123 shipped!" - Format for channel: trim to ≤160 characters for SMS, build full HTML for email, validate ≤4 KB for push
- Send to the provider
Storing templates in the database — rather than hard-coding them in application code — lets content teams update message copy, add locale variants, and run A/B tests without a code deployment or engineering review. Templates also carry a version field so you can roll back to a known-good version immediately if a change causes unexpected rendering behavior in production.
In-App Notifications and WebSocket Delivery#
Push, email, and SMS all deliver notifications to users who are outside the app. In-app notifications work differently — they are delivered in real time to users who are actively using the app right now, and they use WebSockets instead of a third-party provider.
Why WebSocket and not HTTP polling? With regular HTTP, the client must ask ("is there anything new?") and the server responds — the client has to keep asking on a timer (polling). WebSocket flips this: it establishes a persistent, bidirectional connection, so the server can push data to the client at any moment without waiting to be asked. For real-time notifications that should appear within a second of the triggering event, WebSocket is the right tool.
How in-app delivery works:
- When a user opens the app, the client establishes a WebSocket connection to a WebSocket server instance.
- The WebSocket server records the active connection in Redis:
SET ws:user:{user_id} {server_id}. This is necessary because you run multiple WebSocket server instances in production — a notification event needs to reach the specific instance that holds a given user's connection. - When the In-App Worker receives a notification event from Kafka, it looks up
ws:user:{user_id}in Redis. If the key exists, the user is online — the worker forwards the event to the correct WebSocket server, which delivers it in real time. - If the key does not exist, the user is offline — the notification is stored in the event log with status
pending. When the user opens the app, the client fetches recent pending notifications via a standard REST API call, which is the fan-out on read pattern discussed in Step 7.
This two-path design (real-time WebSocket for active users, REST fetch for returning users) means in-app notifications work correctly regardless of whether the user is currently in the app.
Step 5: Deep Dive — Push Notifications and Why They Fail#
This section answers the question at the heart of this case study: why do some notifications arrive late, or not at all?
Push Notification Delivery: APNs and FCM
Push notifications travel through a two-hop path: your server calls APNs (Apple) or FCM (Google), which then delivers to the device. Your server never has a direct connection to the device. This indirection enables offline delivery — but it also introduces failure modes that are invisible until you know where to look.
Why Notifications Fail: The Complete Failure Map#
Understanding failure modes is what separates a system that works in testing from one that holds up in production.
| Failure Mode | Root Cause | Provider Signal | Fix |
|---|---|---|---|
| Stale device token | App was uninstalled or the device was reset | APNs: 410 Gone. FCM: NotRegistered | Delete the token from device_tokens immediately — do not retry with the same token |
| Provider rate limiting | Sending requests faster than the provider's per-project quota allows | HTTP 429 Too Many Requests | Exponential backoff with jitter; use separate API keys for critical vs. promotional traffic to protect critical quota |
| Provider outage | APNs, SendGrid, or Twilio is experiencing an incident | HTTP 5xx errors on provider calls | Circuit breaker: stop sending after N consecutive failures, route to a backup provider (e.g., switch from SendGrid to Mailgun), probe periodically for recovery |
| Oversized payload | Push notification payload exceeds the 4 KB limit | APNs 400 with PayloadTooLarge | Validate payload size before enqueueing; send a 'content-available' push that tells the app to fetch the full content from your API |
| Worker crash mid-send | The worker process dies after calling the provider but before ACKing Kafka | Kafka re-delivers the message to another worker | Idempotency key prevents the duplicate delivery from reaching the user (covered in Step 6) |
| Device offline | User's device has no network connection | Provider returns 200 (message queued) | APNs and FCM handle this automatically; set an expiry time so time-sensitive notifications are discarded if not delivered within a window |
| DND / quiet hours | Notification arrives at 3am in the user's local timezone | Notification delivered but user missed it | Check user timezone and DND preferences before enqueueing; delay to the next allowed window rather than sending and being ignored |
| Template variable missing | A required variable like {{order_id}} is absent from the payload | Message renders as raw placeholder text | Validate that all required template variables are present at the API layer — fail fast with a 400 error rather than enqueuing a broken message |
Step 6: Deep Dive — Reliability: At-Least-Once Delivery and Idempotency#
The message broker's guarantee — at-least-once delivery — is both the system's greatest reliability feature and its most dangerous source of bugs if unhandled.
At-least-once delivery means a message stays in the queue until a consumer explicitly acknowledges it. If a worker crashes between sending a notification and acknowledging the Kafka message, the broker treats the message as unprocessed and re-delivers it to another available worker. This is correct behavior for reliability — but it means the same notification could be processed more than once, producing a duplicate delivery to the user. This is the core tension: the very mechanism that prevents silent drops also introduces the risk of duplicates.
Idempotency: Preventing Duplicate Notifications
The idempotency key pattern uses Redis as a short-lived deduplication checkpoint. Before calling any provider, the worker performs an atomic check-and-set: using Redis's SET NX (set if not exists) command, it either claims the notification ID as its own or discovers another worker already handled it. If it has already been sent, the worker skips the provider call and acknowledges the Kafka message silently. If the worker crashes after the provider call but before acknowledging, Redis still holds the 'sent' marker, so the re-delivered event is safely suppressed on the next attempt.
The Dead Letter Queue (DLQ)#
After a configurable number of retries (typically 3–5), a notification that keeps failing is moved to a Dead Letter Queue (DLQ) — a separate Kafka topic or queue where permanently failed messages are parked for human inspection.
The critical rule: never silently discard a failed notification. A dropped notification has no audit trail and cannot be replayed. A DLQ message carries the full original payload, can be inspected by an on-call engineer to diagnose the root cause, and can be manually replayed once the underlying issue is resolved. The difference between routing to a DLQ and silently deleting the message is the difference between a diagnosable incident and an unexplained gap in the delivery logs.
Step 7: Deep Dive — Fan-Out for Mass Notifications#
The fan-out problem is: how do you efficiently deliver one notification event to millions of recipients? When an e-commerce platform sends a flash sale alert to all opted-in users, it may need to dispatch 10 million push notifications in under 5 minutes. The naive approach — query all users in a loop, send one by one — would not come close. With an average provider round-trip of 50 ms, a single thread sending 10 million notifications sequentially would take roughly 139 hours. Fan-out is about designing around this bottleneck.
Fan-Out Strategies for Mass Notifications
Fan-out amplifies one notification event into millions of individual deliveries. The right strategy depends on two questions: do you need per-user delivery tracking, and what is the ratio of writes to reads? Production systems typically use all three patterns simultaneously for different channels and notification types.
Step 8: Trade-offs and Production Reality#
Every design decision in this system involves a trade-off. Here is a consolidated view of the key choices and when you would decide differently.
| Decision | Choice Made | What We Gave Up | Choose Differently When... |
|---|---|---|---|
| Delivery guarantee | At-least-once delivery + application-level idempotency keys | Exactly-once semantics, which eliminate duplicates at the broker layer but add 2–3x latency overhead | Never for notifications — at-least-once plus idempotency achieves the same user-visible result with far less complexity |
| Queue topology | One Kafka topic per channel; two priority tiers (critical vs. promotional) | Simpler single-queue setup with a channel type field | Start with a single queue if your team is new to Kafka; add channel isolation only when a channel failure has actually caused cross-channel impact in production |
| Fan-out strategy | Hybrid: pub/sub for push blasts, fan-out on write for email and SMS, fan-out on read for in-app | Operational simplicity of a single uniform strategy across all channels | Use fan-out on write for all channels if per-user delivery tracking is a regulatory requirement (e.g., financial notifications that must be audited) |
| Event log database | Cassandra for the append-only, high-write event log | SQL ACID guarantees and simpler ad-hoc querying for debugging | Use PostgreSQL for the event log if daily notification volume is under ~10 million/day; migrate when write throughput or storage growth becomes a bottleneck |
| Template storage | Database-backed templates with runtime updates | Compile-time type checking and validation of template variables (possible when templates live in code) | Keep templates in code if they are simple, change infrequently, and you want build-time guarantees that all variable references are valid |
| Provider redundancy | Circuit breaker with a single active provider per channel, failover to backup | Cost and maintenance of keeping a secondary provider account active and configured | Set up an active-active multi-provider setup for each channel if your SLA requires notifications to survive a primary provider outage without any manual intervention |
| Retry behavior | Exponential backoff with jitter; max 5 retries; move to DLQ on exhaustion | Faster initial re-delivery attempts (a flat 1-second retry interval would respond quicker after transient failures) | Reduce max retries to 3 for SMS where providers charge per attempt; increase the retry window for email where eventual delivery after an hour is better than no delivery at all |
What this design intentionally excludes#
Production notification systems at major companies include additional layers that are out of scope for this case study:
- Engagement analytics — open rates, click-through rates, and unsubscribe rates via provider webhooks, requiring a separate analytics ingestion pipeline
- A/B testing on templates — routing a percentage of users to template variant B and measuring engagement to determine the winning message copy
- Notification digesting — grouping multiple notifications (e.g., "5 new messages") into a single summary to reduce notification fatigue and OS-level throttling
- Compliance tooling — one-click unsubscribe links in marketing emails (required by CAN-SPAM and GDPR), consent audit trails, and data deletion cascades that remove notification history
- Multi-region delivery — routing push requests to the geographically closest APNs or FCM endpoint to reduce round-trip latency for global users
Each of these adds real complexity, but none fundamentally changes the queue-and-worker architecture described here.
Summary#
| Component | Design Decision | Key Reasoning |
|---|---|---|
| Notification Service | Stateless API gateway: validates, enriches, enqueues, returns 202 Accepted immediately | Callers must never wait for delivery; 202 signals accepted-for-processing without implying the notification has been delivered |
| Message Broker | Kafka with one topic per channel and two priority tiers (critical and promotional) | Channel isolation prevents cross-channel failures; priority tiers protect OTP and security alerts from being delayed behind promotional campaign bursts |
| Channel Workers | Independent, auto-scaled consumer groups, one per channel | Worker failures are contained to one channel; each channel's worker count scales independently based on its queue depth |
| At-least-once + Idempotency | Kafka at-least-once delivery with a Redis idempotency key per notification | Guarantees that delivery attempts survive worker crashes without sending duplicate messages to users |
| Retry strategy | Exponential backoff with jitter; max 5 retries; Dead Letter Queue on exhaustion | Prevents thundering herd when a provider recovers from an outage; DLQ preserves all failed messages for diagnosis and manual replay |
| Fan-out | Hybrid: pub/sub for push blasts, fan-out on write (async) for email and SMS | Pure pub/sub is cheapest but provides no per-user tracking; fan-out on write gives a complete delivery audit trail at the cost of write amplification |
| Event Log | Cassandra append-only table, one row per delivery attempt | Cassandra sustains high append throughput at 125 GB/day without impacting the latency of the transactional PostgreSQL queries for preferences and device tokens |
| Stale token pruning | Delete device tokens immediately on APNs 410 Gone or FCM NotRegistered | Stale tokens accumulate silently — a token that was valid yesterday may become invalid tonight when a user upgrades their phone. Pruning on first failure is the only reliable mechanism |
| Templates | Stored in database with variable substitution, locale variants, and versioning | Runtime updates without code deployments; content teams manage copy independently; version rollback is possible when a template change causes unexpected behavior |
The notification system is the queue-and-worker pattern in its most real-world form. Every design decision here — the message broker topology, the idempotency key, the fan-out strategy, the Dead Letter Queue — reappears in payment processing pipelines, event-driven microservices, and asynchronous job systems. The specific providers (APNs, FCM, SendGrid, Twilio) are interchangeable; the architectural pattern is universal.
The central insight: the queue is the reliability guarantee. Without it, a provider outage means lost notifications. With it, notifications accumulate during the outage and drain when the provider recovers — invisibly to the triggering service, and mostly invisibly to the user. This is the answer to why some notifications arrive late: a temporary queue backlog, not a permanent failure. And the answer to why some never arrive at all: a stale device token, a silent rate limit, or an unhandled error that was discarded without a retry.
Sources: