Notification System

A notification system delivers timely alerts to users across multiple channels: a push notification on your phone's lock screen, an email confirming a purchase, or an SMS with a one-time login code. Behind these seemingly simple messages lies one of the most reliability-demanding problems in distributed systems.

You have interacted with notification systems every day — the "your order has shipped" email from Amazon, the "someone liked your post" badge on Instagram, the security code text from your bank. What makes these systems deceptively hard is a combination of three factors: scale (billions of notifications per day across a global user base), channel diversity (each delivery channel — push, email, SMS — has its own protocol, rate limits, and failure modes), and reliability requirements (important notifications must eventually arrive even when devices are offline, providers are down, or the sending service crashes mid-delivery).

This case study answers a question every developer eventually faces: why do some notifications arrive late, or not at all? The answer reveals the core architectural decisions behind every production notification system.

This case study in this section follows the same framework:

  • Clarify constraints (Steps 1–2) — What does the system do, and how much traffic must it handle?
  • High-level design (Step 3) — What are the major components and how do they connect?
  • Deep dives (Steps 4–7) — How do the trickiest parts actually work?
  • Trade-offs (Step 8) — What did we give up, and when would we choose differently?

Step 1: Clarify Requirements#

Before designing anything, pin down exactly what the system needs to do and how well it needs to do it.

Functional Requirements#

These describe what the system does.

FeatureDescriptionPriority
Send push notificationsDeliver alerts to iOS devices via Apple Push Notification service (APNs) and Android devices via Firebase Cloud Messaging (FCM)Core
Send emailsDeliver transactional and marketing emails via providers like SendGridCore
Send SMSDeliver text messages via providers like Twilio for OTP codes and alertsCore
In-app notificationsShow real-time alerts inside the app via WebSockets when users are activeCore
User preferencesLet users opt in or out per notification type and per channel; enforce Do-Not-Disturb hoursCore
Delivery trackingRecord the status of every notification: queued, sent, delivered, or failedCore
Notification templatesStore reusable message templates with variable substitution (Hi {{first_name}})Core
Rate limitingCap how many notifications a user receives per day or week, per channelCore
Scheduled notificationsSend a notification at a specific future time, such as a daily digest at 9am local timeOptional

Non-Functional Requirements#

These describe how well the system works.

PropertyRequirementWhy It Matters
ReliabilityAt-least-once delivery: every notification must be attempted even if a server crashes mid-sendA missed security alert or password reset is a trust-breaking failure — silent drops are worse than delays
LatencyTransactional notifications (OTP, security alerts) delivered within seconds; promotional within minutesUsers abandon password reset flows if the SMS code takes more than 30 seconds to arrive
ScalabilityHandle bursts from under 100 to over 100,000 notifications/second during marketing campaignsA promotional blast to all users must not crash the system or delay critical alerts
Availability99.9%+ uptime on the notification pipeline; failures must be queued, not droppedA downed notification pipeline means every downstream service loses its alerting capability simultaneously
IdempotencySending the same notification twice must not deliver it twice to the userAt-least-once delivery makes duplicates possible; idempotency prevents them from reaching the user
ObservabilityFull audit trail per notification: when it was sent, which provider was used, and what result was returnedDebugging 'I never received my OTP' requires a complete per-notification event log

The central tension in this system: reliability and latency pull in opposite directions. The most reliable delivery mechanism — store in a queue, retry until success — introduces latency. The lowest-latency mechanism — a direct synchronous API call — has no retry and fails silently if the provider is down. The architecture must resolve this tension differently for different notification types: transactional notifications (OTPs, security alerts) need a fast, high-priority path while promotional campaigns can tolerate minutes of delay in exchange for guaranteed eventual delivery.

Step 2: Back-of-the-Envelope Estimation#

MetricCalculationResult
Daily active users (DAU)Assumed50 million
Notifications per user per day5 on average across push, email, and in-app
Total notifications per day50M × 5250 million/day
Average throughput250M ÷ 86,400 seconds/day~2,900 notifications/second
Peak throughput (3× average)2,900 × 3 — a marketing campaign blast triggers a sudden spike~8,700 notifications/second
Storage per notification log row~500 bytes (IDs, status codes, timestamps, payload hash)
Log storage per day250M × 500 bytes~125 GB/day
Log storage per 30 days125 GB × 30~3.75 TB/month
Push workers needed at peak1 worker handles ~100/sec (APNs HTTP/2 with ~5 concurrent in-flight requests per connection at 50ms round-trip each)~90 push workers at peak; auto-scale on queue depth

The key insight from the math: 8,700 notifications/second is not uniformly distributed. Marketing campaigns generate sudden bursts — traffic can spike from near zero to tens of thousands per second in minutes. This load profile is hostile to any architecture that calls provider APIs synchronously in the request path. The message queue is not an optimization; it is the foundational requirement that makes the system possible.

Step 3: High-Level Design#

The fundamental design principle is to decouple generation from delivery: the service that triggers a notification should not wait for it to be sent. An order service, an auth service, or a scheduler fires a notification request and immediately moves on. The slow, retry-heavy work of actually delivering it happens asynchronously, downstream, completely invisible to the caller.

Rendering diagram...

What each component does:

  • Notification Service — The single entry point for all notification requests. It validates the request, fetches user contact info (device tokens, email address, phone number), queries the User Preference Service to check for opt-outs and DND windows, then publishes an event to the appropriate Kafka topic. It returns 202 Accepted immediately — delivery is asynchronous.
  • User Preference Service — Answers the question: "Should this notification be sent to this user on this channel?" It stores per-user, per-channel opt-in/out flags, Do-Not-Disturb windows, and per-category rate limit counters backed by Redis.
  • Message Broker (Kafka) — The backbone of reliability. Kafka is a distributed, append-only log system: events are written to disk and persist until a consumer explicitly acknowledges them. Workers are organized into consumer groups, so if a worker crashes mid-processing, Kafka re-delivers that message to another worker in the same group. A separate Kafka topic per channel means an email provider outage cannot block SMS delivery.
  • Channel Workers — Independently scaled consumer groups, one per channel. Each worker reads an event, renders the message template with user-specific variables, calls the third-party provider API, and writes the result to the event log.
  • Notification Event Log (Cassandra) — An append-only audit table that records every delivery attempt with its status, provider, and timestamps. Cassandra is a distributed database optimized for high write throughput — it can sustain hundreds of thousands of writes per second across many nodes without degrading read performance. This event log powers debugging ("why didn't User A receive their OTP?"), analytics, and delivery confirmation.

API Design#

EndpointMethodRequest BodyResponse
POST /api/v1/notificationsPOST{ request_id, template_id, channels[], recipient: { user_id }, variables, priority }202 Accepted{ notification_id, status: 'queued' }
GET /api/v1/notifications/{id}GET{ notification_id, status, channel, provider, sent_at, delivered_at }
PUT /api/v1/users/{id}/preferencesPUT{ preferences[], dnd_start, dnd_end, timezone }200 OK
POST /api/v1/users/{id}/devicesPOST{ token, platform: 'ios' | 'android' | 'web', app_version }201 Created

Why 202 Accepted and not 200 OK? The 202 status code means "the request was received and will be processed, but processing is not yet complete." This is the correct HTTP response for asynchronous operations. Returning 200 OK would falsely imply the notification was delivered at the moment the API responded.

Why request_id in the body? This is the caller's idempotency key. If the upstream service retries the request due to a network timeout, the same request_id ensures the notification is not enqueued and delivered twice. We explore idempotency in detail in Step 6.

Database Schema#

The system uses two databases with distinct roles: PostgreSQL for structured configuration and preference data, and Cassandra for the high-volume, append-only event log.

-- Device tokens for push notifications
CREATE TABLE device_tokens (
  id          BIGSERIAL PRIMARY KEY,
  user_id     BIGINT REFERENCES users(id) ON DELETE CASCADE,
  token       VARCHAR(512) NOT NULL,
  platform    VARCHAR(10) NOT NULL,   -- 'ios', 'android', 'web'
  is_active   BOOLEAN DEFAULT true,
  last_seen   TIMESTAMPTZ,
  UNIQUE(user_id, token)
);

-- Per-user, per-channel notification preferences
CREATE TABLE notification_preferences (
  user_id           BIGINT REFERENCES users(id) ON DELETE CASCADE,
  notification_type VARCHAR(100) NOT NULL, -- 'order_update', 'promo', 'security_alert'
  channel           VARCHAR(20) NOT NULL,  -- 'push', 'email', 'sms', 'in_app'
  is_enabled        BOOLEAN DEFAULT true,
  PRIMARY KEY (user_id, notification_type, channel)
);

-- Reusable message templates with variable substitution
CREATE TABLE notification_templates (
  id            BIGSERIAL PRIMARY KEY,
  name          VARCHAR(100) UNIQUE NOT NULL,
  channel       VARCHAR(20) NOT NULL,
  subject       VARCHAR(255),             -- email subject line only
  body_template TEXT NOT NULL,            -- "Hi {{first_name}}, your order {{order_id}} shipped!"
  type          VARCHAR(50) NOT NULL,     -- 'transactional', 'promotional', 'security'
  version       INT DEFAULT 1,
  is_active     BOOLEAN DEFAULT true
);

The Cassandra event log stores one row per delivery attempt:

notification_events (Cassandra):
  notification_id   UUID       -- partition key
  user_id           UUID       -- secondary index for per-user queries
  channel           TEXT       -- 'push', 'email', 'sms'
  status            TEXT       -- 'queued', 'sent', 'delivered', 'failed', 'suppressed'
  provider          TEXT       -- 'apns', 'fcm', 'sendgrid', 'twilio'
  provider_msg_id   TEXT       -- the provider's own message ID (for deduplication)
  idempotency_key   TEXT
  created_at        TIMESTAMP
  error_code        TEXT       -- '410', '429', '500', etc.

Why Cassandra for the event log? The event log is append-only and write-heavy — it grows continuously at 125 GB/day. Cassandra is designed for exactly this pattern: high write throughput with horizontal scaling by partition key (the field that determines which node stores a given row). PostgreSQL can store this data, but at this write volume it would require significant resources that would compete with the latency-sensitive preference and device queries that need fast reads. Cassandra lets us scale writes independently of the transactional PostgreSQL workload.

Step 4: Deep Dive — The Message Queue Pipeline#

The message broker is what separates a system that tries to deliver notifications from one that guarantees delivery attempts. This section explains how the pipeline works and why every design decision matters.

The Notification Delivery Pipeline

Each notification travels through a queue before reaching a worker, which renders the template and calls the provider. The queue is the durability guarantee: messages survive worker crashes, provider outages, and deployment restarts. Each channel gets its own Kafka topic so a failure in one channel cannot block another.

Rendering diagram...

Template Rendering#

Templates decouple message content from sending logic. A worker receiving { templateId: 'order_shipped_push', variables: { first_name: 'Alice', order_id: 'ORD-123' } } follows these steps:

  1. Fetch the template from Redis cache — Redis is an in-memory key-value store that returns data in under a millisecond, far faster than a database query. On a cache miss, fall back to the database and populate the cache for subsequent requests.
  2. Fetch the user's locale and timezone
  3. Select the matching locale variant if available (en-US, fr-FR, ja-JP)
  4. Substitute variables: "Hi {{first_name}}, your order {{order_id}} shipped!""Hi Alice, your order ORD-123 shipped!"
  5. Format for channel: trim to ≤160 characters for SMS, build full HTML for email, validate ≤4 KB for push
  6. Send to the provider

Storing templates in the database — rather than hard-coding them in application code — lets content teams update message copy, add locale variants, and run A/B tests without a code deployment or engineering review. Templates also carry a version field so you can roll back to a known-good version immediately if a change causes unexpected rendering behavior in production.

In-App Notifications and WebSocket Delivery#

Push, email, and SMS all deliver notifications to users who are outside the app. In-app notifications work differently — they are delivered in real time to users who are actively using the app right now, and they use WebSockets instead of a third-party provider.

Why WebSocket and not HTTP polling? With regular HTTP, the client must ask ("is there anything new?") and the server responds — the client has to keep asking on a timer (polling). WebSocket flips this: it establishes a persistent, bidirectional connection, so the server can push data to the client at any moment without waiting to be asked. For real-time notifications that should appear within a second of the triggering event, WebSocket is the right tool.

How in-app delivery works:

  1. When a user opens the app, the client establishes a WebSocket connection to a WebSocket server instance.
  2. The WebSocket server records the active connection in Redis: SET ws:user:{user_id} {server_id}. This is necessary because you run multiple WebSocket server instances in production — a notification event needs to reach the specific instance that holds a given user's connection.
  3. When the In-App Worker receives a notification event from Kafka, it looks up ws:user:{user_id} in Redis. If the key exists, the user is online — the worker forwards the event to the correct WebSocket server, which delivers it in real time.
  4. If the key does not exist, the user is offline — the notification is stored in the event log with status pending. When the user opens the app, the client fetches recent pending notifications via a standard REST API call, which is the fan-out on read pattern discussed in Step 7.

This two-path design (real-time WebSocket for active users, REST fetch for returning users) means in-app notifications work correctly regardless of whether the user is currently in the app.

Step 5: Deep Dive — Push Notifications and Why They Fail#

This section answers the question at the heart of this case study: why do some notifications arrive late, or not at all?

Push Notification Delivery: APNs and FCM

Push notifications travel through a two-hop path: your server calls APNs (Apple) or FCM (Google), which then delivers to the device. Your server never has a direct connection to the device. This indirection enables offline delivery — but it also introduces failure modes that are invisible until you know where to look.

Rendering diagram...

Why Notifications Fail: The Complete Failure Map#

Understanding failure modes is what separates a system that works in testing from one that holds up in production.

Failure ModeRoot CauseProvider SignalFix
Stale device tokenApp was uninstalled or the device was resetAPNs: 410 Gone. FCM: NotRegisteredDelete the token from device_tokens immediately — do not retry with the same token
Provider rate limitingSending requests faster than the provider's per-project quota allowsHTTP 429 Too Many RequestsExponential backoff with jitter; use separate API keys for critical vs. promotional traffic to protect critical quota
Provider outageAPNs, SendGrid, or Twilio is experiencing an incidentHTTP 5xx errors on provider callsCircuit breaker: stop sending after N consecutive failures, route to a backup provider (e.g., switch from SendGrid to Mailgun), probe periodically for recovery
Oversized payloadPush notification payload exceeds the 4 KB limitAPNs 400 with PayloadTooLargeValidate payload size before enqueueing; send a 'content-available' push that tells the app to fetch the full content from your API
Worker crash mid-sendThe worker process dies after calling the provider but before ACKing KafkaKafka re-delivers the message to another workerIdempotency key prevents the duplicate delivery from reaching the user (covered in Step 6)
Device offlineUser's device has no network connectionProvider returns 200 (message queued)APNs and FCM handle this automatically; set an expiry time so time-sensitive notifications are discarded if not delivered within a window
DND / quiet hoursNotification arrives at 3am in the user's local timezoneNotification delivered but user missed itCheck user timezone and DND preferences before enqueueing; delay to the next allowed window rather than sending and being ignored
Template variable missingA required variable like {{order_id}} is absent from the payloadMessage renders as raw placeholder textValidate that all required template variables are present at the API layer — fail fast with a 400 error rather than enqueuing a broken message

Step 6: Deep Dive — Reliability: At-Least-Once Delivery and Idempotency#

The message broker's guarantee — at-least-once delivery — is both the system's greatest reliability feature and its most dangerous source of bugs if unhandled.

At-least-once delivery means a message stays in the queue until a consumer explicitly acknowledges it. If a worker crashes between sending a notification and acknowledging the Kafka message, the broker treats the message as unprocessed and re-delivers it to another available worker. This is correct behavior for reliability — but it means the same notification could be processed more than once, producing a duplicate delivery to the user. This is the core tension: the very mechanism that prevents silent drops also introduces the risk of duplicates.

Idempotency: Preventing Duplicate Notifications

The idempotency key pattern uses Redis as a short-lived deduplication checkpoint. Before calling any provider, the worker performs an atomic check-and-set: using Redis's SET NX (set if not exists) command, it either claims the notification ID as its own or discovers another worker already handled it. If it has already been sent, the worker skips the provider call and acknowledges the Kafka message silently. If the worker crashes after the provider call but before acknowledging, Redis still holds the 'sent' marker, so the re-delivered event is safely suppressed on the next attempt.

Rendering diagram...

The Dead Letter Queue (DLQ)#

After a configurable number of retries (typically 3–5), a notification that keeps failing is moved to a Dead Letter Queue (DLQ) — a separate Kafka topic or queue where permanently failed messages are parked for human inspection.

The critical rule: never silently discard a failed notification. A dropped notification has no audit trail and cannot be replayed. A DLQ message carries the full original payload, can be inspected by an on-call engineer to diagnose the root cause, and can be manually replayed once the underlying issue is resolved. The difference between routing to a DLQ and silently deleting the message is the difference between a diagnosable incident and an unexplained gap in the delivery logs.

Step 7: Deep Dive — Fan-Out for Mass Notifications#

The fan-out problem is: how do you efficiently deliver one notification event to millions of recipients? When an e-commerce platform sends a flash sale alert to all opted-in users, it may need to dispatch 10 million push notifications in under 5 minutes. The naive approach — query all users in a loop, send one by one — would not come close. With an average provider round-trip of 50 ms, a single thread sending 10 million notifications sequentially would take roughly 139 hours. Fan-out is about designing around this bottleneck.

Fan-Out Strategies for Mass Notifications

Fan-out amplifies one notification event into millions of individual deliveries. The right strategy depends on two questions: do you need per-user delivery tracking, and what is the ratio of writes to reads? Production systems typically use all three patterns simultaneously for different channels and notification types.

Rendering diagram...

Step 8: Trade-offs and Production Reality#

Every design decision in this system involves a trade-off. Here is a consolidated view of the key choices and when you would decide differently.

DecisionChoice MadeWhat We Gave UpChoose Differently When...
Delivery guaranteeAt-least-once delivery + application-level idempotency keysExactly-once semantics, which eliminate duplicates at the broker layer but add 2–3x latency overheadNever for notifications — at-least-once plus idempotency achieves the same user-visible result with far less complexity
Queue topologyOne Kafka topic per channel; two priority tiers (critical vs. promotional)Simpler single-queue setup with a channel type fieldStart with a single queue if your team is new to Kafka; add channel isolation only when a channel failure has actually caused cross-channel impact in production
Fan-out strategyHybrid: pub/sub for push blasts, fan-out on write for email and SMS, fan-out on read for in-appOperational simplicity of a single uniform strategy across all channelsUse fan-out on write for all channels if per-user delivery tracking is a regulatory requirement (e.g., financial notifications that must be audited)
Event log databaseCassandra for the append-only, high-write event logSQL ACID guarantees and simpler ad-hoc querying for debuggingUse PostgreSQL for the event log if daily notification volume is under ~10 million/day; migrate when write throughput or storage growth becomes a bottleneck
Template storageDatabase-backed templates with runtime updatesCompile-time type checking and validation of template variables (possible when templates live in code)Keep templates in code if they are simple, change infrequently, and you want build-time guarantees that all variable references are valid
Provider redundancyCircuit breaker with a single active provider per channel, failover to backupCost and maintenance of keeping a secondary provider account active and configuredSet up an active-active multi-provider setup for each channel if your SLA requires notifications to survive a primary provider outage without any manual intervention
Retry behaviorExponential backoff with jitter; max 5 retries; move to DLQ on exhaustionFaster initial re-delivery attempts (a flat 1-second retry interval would respond quicker after transient failures)Reduce max retries to 3 for SMS where providers charge per attempt; increase the retry window for email where eventual delivery after an hour is better than no delivery at all

What this design intentionally excludes#

Production notification systems at major companies include additional layers that are out of scope for this case study:

  • Engagement analytics — open rates, click-through rates, and unsubscribe rates via provider webhooks, requiring a separate analytics ingestion pipeline
  • A/B testing on templates — routing a percentage of users to template variant B and measuring engagement to determine the winning message copy
  • Notification digesting — grouping multiple notifications (e.g., "5 new messages") into a single summary to reduce notification fatigue and OS-level throttling
  • Compliance tooling — one-click unsubscribe links in marketing emails (required by CAN-SPAM and GDPR), consent audit trails, and data deletion cascades that remove notification history
  • Multi-region delivery — routing push requests to the geographically closest APNs or FCM endpoint to reduce round-trip latency for global users

Each of these adds real complexity, but none fundamentally changes the queue-and-worker architecture described here.

Summary#

ComponentDesign DecisionKey Reasoning
Notification ServiceStateless API gateway: validates, enriches, enqueues, returns 202 Accepted immediatelyCallers must never wait for delivery; 202 signals accepted-for-processing without implying the notification has been delivered
Message BrokerKafka with one topic per channel and two priority tiers (critical and promotional)Channel isolation prevents cross-channel failures; priority tiers protect OTP and security alerts from being delayed behind promotional campaign bursts
Channel WorkersIndependent, auto-scaled consumer groups, one per channelWorker failures are contained to one channel; each channel's worker count scales independently based on its queue depth
At-least-once + IdempotencyKafka at-least-once delivery with a Redis idempotency key per notificationGuarantees that delivery attempts survive worker crashes without sending duplicate messages to users
Retry strategyExponential backoff with jitter; max 5 retries; Dead Letter Queue on exhaustionPrevents thundering herd when a provider recovers from an outage; DLQ preserves all failed messages for diagnosis and manual replay
Fan-outHybrid: pub/sub for push blasts, fan-out on write (async) for email and SMSPure pub/sub is cheapest but provides no per-user tracking; fan-out on write gives a complete delivery audit trail at the cost of write amplification
Event LogCassandra append-only table, one row per delivery attemptCassandra sustains high append throughput at 125 GB/day without impacting the latency of the transactional PostgreSQL queries for preferences and device tokens
Stale token pruningDelete device tokens immediately on APNs 410 Gone or FCM NotRegisteredStale tokens accumulate silently — a token that was valid yesterday may become invalid tonight when a user upgrades their phone. Pruning on first failure is the only reliable mechanism
TemplatesStored in database with variable substitution, locale variants, and versioningRuntime updates without code deployments; content teams manage copy independently; version rollback is possible when a template change causes unexpected behavior

The notification system is the queue-and-worker pattern in its most real-world form. Every design decision here — the message broker topology, the idempotency key, the fan-out strategy, the Dead Letter Queue — reappears in payment processing pipelines, event-driven microservices, and asynchronous job systems. The specific providers (APNs, FCM, SendGrid, Twilio) are interchangeable; the architectural pattern is universal.

The central insight: the queue is the reliability guarantee. Without it, a provider outage means lost notifications. With it, notifications accumulate during the outage and drain when the provider recovers — invisibly to the triggering service, and mostly invisibly to the user. This is the answer to why some notifications arrive late: a temporary queue backlog, not a permanent failure. And the answer to why some never arrive at all: a stale device token, a silent rate limit, or an unhandled error that was discarded without a retry.

Sources: