The outbox pattern: reliable event publishing in microservices

gabriel
11 Min Read

You only notice unreliable event publishing after it hurts you.

An order exists in the database, but downstream services never hear about it. Inventory is wrong. Billing never fires. Support escalates. Engineering digs through logs and finds the uncomfortable truth, the write succeeded, the event did not. Somewhere between a database commit and a message broker publish, reality split in two.

That gap is not a bug. It is a design flaw.

The outbox pattern exists because modern microservices routinely lie to themselves about reliability. They assume that writing to a database and publishing an event are “close enough” to atomic. They are not. The outbox pattern fixes this by redefining what it means to publish an event. Instead of sending messages directly to a broker, your service records the event as data, inside the same transaction as the business change. Only then does a separate process publish it outward.

This article is a practitioner’s guide to the outbox pattern. Not the idealized version. The one that survives crashes, retries, partial failures, and the uncomfortable math of throughput and backlog. If you run event-driven systems in production, this pattern is not optional. It is inevitable.

Why dual writes fail, even when they “usually work”

The classic microservice write path looks harmless:

  1. Begin database transaction

  2. Write business data

  3. Commit

  4. Publish event to Kafka or another broker

In a clean diagram, this works. In production, it fails in boring, repeatable ways.

The service can crash after the database commit but before the publish call completes. The broker can accept the message but drop the acknowledgement, causing retries and duplicates. Pods can be terminated mid-flight during deploys. Network partitions can turn synchronous publishes into probabilistic guesses.

These are not edge cases. They are expected behavior in distributed systems.

Multiple architecture guides from cloud providers describe this exact failure mode as the dual-write problem, where two independent systems cannot be updated atomically without distributed transactions, which most teams wisely avoid. The result is silent inconsistency, the worst kind to debug.

The outbox pattern works because it eliminates the dual write. You write once, to your database. Everything else becomes a downstream projection of that fact.

See also  CQRS explained: separating reads and writes for scalability

What experienced practitioners actually agree on

After reviewing canonical microservices guidance, vendor documentation, and real-world implementations, a few points are remarkably consistent.

Chris Richardson, founder of Microservices.io, describes transactional outbox as a rule, not an optimization. Business state and events must be written in the same transaction, then published asynchronously by a separate component. The publisher is not part of the request path. That separation is the entire point.

Debezium’s engineering documentation emphasizes that the outbox exists to preserve causal consistency. If the database says something happened, the event stream must eventually say the same thing, even if failures occur between commit and publish.

Microsoft’s cloud architecture guidance frames outbox as a reliability boundary. Databases provide transactional guarantees. Message brokers do not. You anchor truth in the database, then stream outward using change feeds or background publishers.

Taken together, the consensus is clear. The outbox pattern does not promise exactly-once delivery. It promises something more useful: no lost intent. If your database committed a change, the system will eventually publish that fact, or you will have a durable record explaining why it did not.

How the outbox pattern works in practice

At its core, the outbox pattern adds one table and one responsibility.

The outbox table

The outbox table lives in the same database as your service’s business data. It stores events as rows, not messages in flight.

A typical outbox row includes:

  • A unique event ID

  • Aggregate or entity ID, such as an order ID

  • Event type

  • Serialized payload

  • Timestamp

  • Correlation or trace ID

  • Publish status or published timestamp

Crucially, this row is written inside the same database transaction as the business update.

The relay or publisher

A separate process, sometimes part of the same service and sometimes external, reads pending outbox rows and publishes them to the message broker. Once published, it marks the row as sent or deletes it.

If the service crashes, the rows remain. If the broker is down, the rows accumulate. If publishing partially succeeds, retries happen from a known state.

This is the entire pattern. Simple in concept, subtle in execution.

See also  Webhook architecture patterns for real-time integrations

Choosing a relay strategy without overthinking it

There are three common ways teams implement the publishing side. None are universally correct.

Approach Description Strengths Tradeoffs
Polling relay Periodically queries the outbox table Simple, portable, fast to ship Polling load, tuning required
CDC-based relay Uses change data capture to stream rows High throughput, low app code Operational complexity
Framework-managed outbox Messaging framework persists messages Strong ergonomics Vendor coupling

If you already operate Kafka Connect or similar infrastructure, CDC-based outbox routing can be elegant and scalable. If you do not, polling is often the fastest way to correctness. Many teams start with polling and evolve later.

The mistake is not choosing the “wrong” option. The mistake is pretending the problem does not exist.

How to implement outbox safely, step by step

Step 1: Design the outbox schema like a product

Outbox tables are operational infrastructure. Treat them that way.

Include explicit identifiers, timestamps, and correlation fields. Avoid opaque blobs without metadata. You will need to debug, replay, and audit these rows under pressure.

If your broker uses partitioning for ordering, store the partition key explicitly. Do not reconstruct it later.

Step 2: Accept that duplicates will happen, and design for them

Outbox prevents message loss. It does not eliminate duplicates.

Retries, crashes, and ambiguous acknowledgements all produce at-least-once delivery. This is normal.

Practical systems handle this by:

  • Including a stable event ID in every message

  • Making consumers idempotent

  • Optionally maintaining an inbox or deduplication table on the consumer side

If you rely on broker-level “exactly once” features, treat them as an optimization, not a guarantee.

Step 3: Build the relay with backpressure and observability

A production-ready relay does more than loop and publish.

It batches rows to reduce overhead. It limits in-flight publishes to protect the broker. It retries with jitter. It exposes metrics like backlog size, publish latency, and failure counts.

If your relay fails silently, you have simply moved the problem.

Step 4: Practice failure before failure finds you

Before shipping, simulate:

  • A crash between commit and publish

  • A crash after publish but before marking the row

  • Broker unavailability for extended periods

  • Consumer-side duplicate handling

See also  Secrets management strategies for modern cloud applications

If these scenarios are not boring, your system is not ready.

A concrete sizing example with real numbers

Consider an Orders service processing 500 writes per second at peak. Each event payload averages 1 KB.

That produces roughly:

  • 500 KB per second

  • 30 MB per minute

  • 1.8 GB per hour

If you retain outbox rows for 24 hours before cleanup, you are storing over 40 GB of data.

This surprises teams the first time they measure it.

Two implications matter:

Retention policies are not optional. Clean up aggressively once rows are published and safe to discard or archive.
Relay throughput determines recovery time. If your relay publishes 1,000 events per second and the broker is down for 10 minutes, you accumulate 300,000 rows. Clearing that backlog takes about five minutes once the broker recovers.

These numbers belong in your runbooks. Reliability is math, not hope.

Common questions teams ask too late

Do I still need idempotent consumers if I use outbox?

Yes. Outbox guarantees durability of intent, not uniqueness of delivery.

Should outbox rows be deleted or updated after publish?

Most teams either delete rows after successful publish or move them to a cold archive with a TTL. Leaving them indefinitely is rarely worth the cost.

Is polling “good enough”?

Often, yes. Polling is predictable, easy to reason about, and sufficient for many workloads. CDC shines at very high throughput or when you already operate the tooling.

Does this only apply to relational databases?

No. Systems like Cosmos DB implement the same idea using transactional batches and change feeds. The pattern is conceptual, not relational.

Honest takeaway

The outbox pattern adds complexity. There is no way around that. You add a table, a relay, operational metrics, and failure modes you now own.

What you remove is worse. You eliminate silent data loss. You replace guesswork with durable state. You gain the ability to replay, reason, and recover.

If you build event-driven microservices long enough, you will implement the outbox pattern. The only real decision is whether you do it deliberately, or under incident pressure at 2 a.m.

Share This Article
With over a decade of distinguished experience in news journalism, Gabriel has established herself as a masterful journalist. She brings insightful conversation and deep tech knowledge to Technori.