Event-Driven Design

In the previous section, you learned that asynchronous communication decouples services: a producer publishes a message to a broker, the broker holds it, and a consumer processes it later — independently of the producer. Event-Driven Design takes that idea further. It elevates events to a core architectural concern, not just an implementation detail.

Instead of services calling each other directly, every significant state change in the system is published as an event: OrderPlaced, PaymentFailed, UserSignedUp. Any part of the system that cares about that event can subscribe and react — without the producer knowing or caring who is listening.

This section covers the three main patterns and technologies you will encounter:

  • Pub/Sub (Redis Pub/Sub) — lightweight, fire-and-forget fan-out
  • Message Queues (RabbitMQ) — reliable task delivery with smart routing
  • Event Streams (Kafka) — durable, replayable, high-throughput event logs

And the essential safety mechanism that applies to all of them:

  • Dead Letter Queues (DLQ) — a safety net for messages that could not be processed

What Is an Event?#

An event is an immutable record of something that happened in your system. Three properties distinguish events from regular function calls or API requests:

  1. Past tense — events describe what happened, not what should happen. OrderPlaced, not PlaceOrder. PaymentFailed, not FailPayment. This naming convention signals that the producer is reporting a fact, not issuing a command.
  2. Fire-and-forget — the producer publishes the event and moves on. It does not wait for a response and does not know who, if anyone, is consuming the event.
  3. Immutable — once published, an event cannot be modified. Consumers react to it, but they cannot change it.
AspectDirect API CallEvent
DirectionRequest → Response (bidirectional)One-way: producer publishes, consumers react
CouplingProducer must know about the consumerProducer is unaware of its consumers
TimingSynchronous or near-synchronousAsynchronous — consumer reacts when ready
NamingImperative: CreateOrder, SendEmailPast tense: OrderCreated, EmailQueued
Fan-outMust call each service explicitlyAny number of consumers subscribe independently

The Three Players: Producers, Brokers, and Consumers#

Every event-driven system has three roles:

  • A producer detects a state change and publishes an event to a broker.
  • A broker (also called a message bus or event bus) receives, routes, stores, and delivers events to the appropriate consumers.
  • A consumer subscribes to the events it cares about and reacts to them.

The broker is the critical piece — it is what fully decouples producers from consumers. Without a broker, a producer would need to maintain a list of every consumer and call each one directly. With a broker, the producer simply publishes; the broker handles delivery. Adding a new consumer requires no changes to the producer at all.

Event-Driven Architecture: The Core Pattern

A single OrderPlaced event from the Order Service fans out to three independent consumers. None of them are aware of each other, and the Order Service doesn't know they exist. Adding a new consumer (e.g., a Fraud Detection service) requires zero changes to the producer.

Rendering diagram...

Pub/Sub: Fast Fan-Out with Redis#

The simplest form of event-driven design is Pub/Sub (Publish/Subscribe). Publishers push messages to named channels; every currently-subscribed subscriber receives the message instantly.

Redis Pub/Sub is the most common lightweight implementation. Because Redis stores everything in memory, it is extremely fast — sub-millisecond latency, capable of delivering up to roughly one million messages per second. But this speed comes with a fundamental trade-off: Redis Pub/Sub has no persistence. If a subscriber is offline when a message is published, that message is gone forever. Redis does not buffer, retry, or store messages in any form.

Redis Pub/Sub: Live Fan-Out

Redis Pub/Sub works like a live radio broadcast. The publisher transmits on a channel; all active subscribers tuned to that channel receive the message in real time. If no subscriber is listening — or if a subscriber disconnects for even a moment — the message is permanently lost. Redis does not store messages.

Rendering diagram...

A note on Redis Streams: Redis 5.0 introduced Redis Streams, a separate data structure that adds persistence, consumer groups, and at-least-once delivery to Redis messaging. If you need the simplicity of Redis plus durability and replay, Redis Streams is the right choice — think of it as Redis's answer to Kafka-style event streaming, though at a much smaller scale. The key distinction: Redis Pub/Sub is push-based and ephemeral; Redis Streams is pull-based and durable.

Message Queues: Reliable Task Routing with RabbitMQ#

When you need reliable delivery — guarantees that every message will be processed, even if a consumer crashes mid-processing — a traditional message queue is the right tool. RabbitMQ is the most widely used open-source message broker and implements the AMQP (Advanced Message Queuing Protocol) standard.

RabbitMQ's key architectural insight is the separation between exchanges (which receive and route messages) and queues (which store and deliver them). Producers publish to exchanges; consumers read from queues. The exchange applies routing logic to decide which queue or queues receive each message. This separation means you can change routing rules without touching any producer or consumer code.

RabbitMQ: Exchanges and Queues

RabbitMQ's routing layer sits between producers and queues. Producers publish to an exchange with a routing key; the exchange applies its routing rules and forwards the message to matching queues. This lets you build flexible routing topologies without putting any routing logic in the producer.

Rendering diagram...

The four exchange types explained:

  • Direct exchange: Routes messages whose routing key exactly matches a queue's binding key. Use this for precise task routing — for example, routing routing_key=email only to the email queue.
  • Fanout exchange: Broadcasts every message to all bound queues, ignoring routing keys entirely. Use this for fan-out scenarios: send one event to every consumer simultaneously.
  • Topic exchange: Routes based on pattern matching with wildcards (* matches exactly one word, # matches zero or more words). For example, the pattern logs.* matches logs.error and logs.info but not logs.error.critical. Use this for flexible, hierarchical routing.
  • Headers exchange: Routes based on message header attributes rather than the routing key. Rarely used in practice — the Topic exchange covers most routing needs more simply.

Event Streams: Durable and Replayable with Kafka#

Where Redis Pub/Sub loses messages when subscribers are offline and RabbitMQ deletes messages after acknowledgment, Apache Kafka takes a fundamentally different approach: events are written to a persistent, append-only log and are not deleted when consumers read them. Consumers read from the log at their own pace, and any consumer can go back and re-read historical events at any point.

This design enables two capabilities that are impossible with traditional message queues:

  1. Multiple independent consumer groups can each read the full stream independently, as if each had its own private copy of every event — without interfering with one another.
  2. Replay — any consumer can rewind to any point in the log and reprocess historical events. This is invaluable for debugging, backfill jobs, and recovering from bugs in consumer logic.

Topics, Partitions, and Offsets#

A Kafka topic is a named, ordered log of events. Topics are split into partitions for parallelism — each partition is an independent, ordered sequence of events stored durably on disk. A Kafka cluster distributes partitions across its brokers, so different partitions of the same topic can be handled by different machines. Events within a single partition are strictly ordered; across different partitions, there is no global ordering guarantee.

Each event in a partition is assigned a sequential offset — a number that permanently identifies its position in that partition's log. Consumers track which offset they have processed up to. Unlike RabbitMQ, Kafka does not delete a message when a consumer reads it. Messages are retained for a configurable period (default: 7 days; can be set to indefinite).

Kafka: Partitioned, Durable Event Log

A Kafka topic is divided into partitions, each an ordered, append-only log stored on disk. Every message has a permanent offset. Consumer groups independently track their own read position — meaning multiple services can consume the same stream without interfering with each other. Messages are not deleted after being read.

Rendering diagram...

Consumer Groups: Load Balancing and Fan-out in One#

Kafka's consumer group mechanism is what makes it uniquely powerful:

  • Within a consumer group, Kafka assigns each partition to exactly one consumer. This distributes load across multiple workers — similar to how a traditional task queue works.
  • Across different consumer groups, Kafka delivers the full stream to each group independently. This provides fan-out — similar to Pub/Sub.

Kafka unifies both patterns in a single system. One topic can simultaneously act as a work queue for a group of email workers and broadcast every event to a dozen independent services — with no reconfiguration of the topic or the producer required.

Choosing Your Broker: Redis vs. RabbitMQ vs. Kafka#

PropertyRedis Pub/SubRabbitMQApache Kafka
PersistenceNone (in-memory only)Optional (disk)Always (disk)
Delivery guaranteeAt-most-once (fire-and-forget)At-least-once or exactly-onceAt-least-once or exactly-once
Message replayNoNoYes (configurable retention)
OrderingBest-effortFIFO per queueStrict within a partition
Throughput~1M msg/s~50K–100K msg/s100K–1M+ msg/s
RoutingChannel name onlyPowerful (4 exchange types)Topic + partition key
Consumer modelPush (broadcast)Push or pullPull (consumer groups)
Operational complexityVery lowModerateHigh
Best forEphemeral real-time signalsReliable task queues and routingHigh-throughput, durable, multi-consumer streams

Decision rule of thumb:

  • You need fast, ephemeral fan-out and already use Redis → Redis Pub/Sub
  • You need reliable task processing (each task done once, with routing by type) → RabbitMQ or AWS SQS
  • You need multiple independent consumers, message replay, or very high throughputKafka or AWS Kinesis

When in doubt, start with a managed queue (AWS SQS, Google Cloud Pub/Sub) before self-hosting RabbitMQ or Kafka. Managed services remove the operational burden and scale automatically. Add the complexity of self-hosted infrastructure only when you have a concrete reason that managed services cannot meet.

Dead Letter Queues: Your Safety Net for Failed Messages#

In any event-driven system, some messages will fail to process. The consumer could crash, the message could be malformed, a database it depends on could be temporarily unavailable, or a schema change could make an older message format invalid. If you do nothing, the message either disappears silently or enters an infinite retry loop that blocks every healthy message behind it.

A Dead Letter Queue (DLQ) is the solution: a dedicated queue (or topic, in Kafka's terminology) where messages are automatically routed after they exceed the maximum number of delivery attempts. Instead of being lost or looping forever, the message is preserved in the DLQ for investigation and eventual replay once the root cause is fixed.

Dead Letter Queue: Handling Poison Messages

A 'poison message' is one that cannot be processed successfully — perhaps because it's malformed, references data that no longer exists, or triggers a bug in the consumer. Without a DLQ, the broker keeps redelivering the poison message, blocking every healthy message behind it. The DLQ isolates the failure, lets normal processing continue, and gives you a safe place to investigate.

Rendering diagram...

Common reasons messages end up in a DLQ:

  1. Malformed payload — the message body does not match the expected schema (a required field is missing, or the format changed between a producer version and a consumer version).
  2. Max retries exceeded — the consumer threw an exception N times; the broker gave up and moved the message to the DLQ.
  3. Message TTL expired — the message sat in the queue longer than its configured time-to-live before any consumer read it.
  4. Business logic failure — the message references an entity that no longer exists (e.g., an UpdateCustomerAddress event for a customer who was since deleted).
  5. Consumer bug — a code defect causes the consumer to crash on a specific pattern of input that was not caught in testing.

DLQ implementation by broker:

  • AWS SQS: Configure a redrive policy on the source queue specifying maxReceiveCount and the DLQ ARN. SQS automatically moves messages to the DLQ after maxReceiveCount failed delivery attempts.
  • RabbitMQ: Add x-dead-letter-exchange (and optionally x-dead-letter-routing-key) arguments to the source queue definition. RabbitMQ then routes negatively-acknowledged or TTL-expired messages to the specified dead-letter exchange automatically.
  • Kafka: There is no native DLQ support. Implement it in consumer code: catch processing exceptions, publish the failed message to a dedicated dead-letter topic (e.g., order.placed.dlq), commit the offset, and continue processing. After fixing the bug, replay the DLQ topic.

Event-Driven Design for AI and Agentic Systems#

Event-driven architecture becomes especially important when building systems that involve AI agents. Agents often have long-running, unpredictable execution times — a reasoning step might take 2 seconds or 2 minutes. Coupling agent execution to synchronous API calls creates fragile systems that time out, cascade failures, and cannot scale the individual steps independently.

Event-Driven Multi-Agent Pipeline

An AI agent pipeline built on events instead of direct calls. Each agent publishes its output as an event; downstream agents subscribe and react independently. Adding a new agent (e.g., a Fact Checker) requires zero changes to existing agents — it simply subscribes to the relevant topic.

Rendering diagram...

Why events solve the AI agent connection problem: In a direct-call architecture with 10 agents, you potentially need up to 45 point-to-point connections (n×(n-1)/2 for a fully connected graph). Every agent must know the address, protocol, and schema of every other agent it might call. Add an 11th agent and you may need 10 new connections. With an event broker, every agent connects to exactly one thing — the broker — and publishes or subscribes to named topics. Adding a new agent adds exactly one connection to the broker, not connections to every other agent.

What AI Agents Get Wrong#

AI Agents and Event-Driven Architecture

AI agents understand Pub/Sub and message queues conceptually, but they default to the simplest possible implementation — often Redis Pub/Sub regardless of whether messages need to survive consumer downtime. They also rarely implement Dead Letter Queues unless explicitly asked.

Rendering diagram...

Summary#

ConceptKey Takeaway
Event-Driven DesignEvents are immutable records of what happened, named in the past tense. Producers publish events; consumers subscribe and react independently. The broker fully decouples them.
Redis Pub/SubSub-millisecond, fire-and-forget fan-out. At-most-once delivery — if a subscriber is offline, the message is permanently lost. Best for ephemeral real-time signals where dropped messages are acceptable.
RabbitMQReliable task queues with smart routing via exchanges (Direct, Fanout, Topic, Headers). At-least-once delivery with acknowledgments. Messages are deleted after consumption — no replay.
KafkaDurable, append-only event log. Messages are retained for a configurable period. Consumer groups provide both load balancing (within a group) and fan-out (across groups). Built for high-throughput, multi-consumer, replayable event streams.
Consumer groupsIn Kafka, consumer groups balance load within the group and allow independent fan-out across groups — one topic, many independent consumers, each tracking their own offset.
Dead Letter QueuesYour safety net for messages that fail processing after N retries. Always configure a DLQ in production; monitor its depth; implement redrive after fixing the root cause.
AI agent pipelinesEDA solves the connection explosion in multi-agent systems — O(n) broker connections vs. O(n²) direct calls. Each agent publishes its output as an event; downstream agents subscribe and react independently.
AI agent defaultAI agents default to Redis Pub/Sub regardless of durability needs, and omit DLQs unless asked. Always specify your delivery guarantees and explicitly request DLQ configuration.

Sources: