Your app does not fail “real-time” when the median request gets slow. It fails when one user sees a cursor freeze, a bid arrives late, a dashboard skips a market move, or a multiplayer action lands after the moment has passed.
A real-time application is any system where users expect state changes to arrive almost immediately, usually through WebSockets, server-sent events, WebRTC data channels, or pub/sub messaging. Scaling one means keeping that expectation intact as users, rooms, devices, regions, and message volume grow.
The trick is not “add more servers.” That works for stateless HTTP. Real-time systems are different because connections are long-lived, state is hot, ordering matters, and queues can silently turn a healthy-looking system into a latency casino.
Start by Treating Latency as a Budget, Not a Vibe
Gil Tene, CTO of Azul Systems, has spent years warning engineers that tail latency, not average latency, defines user experience. His point is brutal but useful: if your p99 is bad, real users are living there, especially in distributed systems where one screen often depends on many calls.
Netflix’s concurrency-limits project takes a similar view from the operational side. Instead of only limiting requests per second, it uses TCP-style congestion ideas to detect how much concurrency a service can handle before latency degrades. Martin Thompson, low-latency systems engineer, popularized “mechanical sympathy,” the idea that predictable memory access, batching, and ownership patterns matter when you are chasing consistent latency.
Together, they suggest a boring but powerful rule: scale real-time apps by controlling queues, locality, fan-out, and overload. Not by hoping autoscaling catches up.
Design the Connection Layer Separately From the Work Layer
Your WebSocket servers should not also be your business logic, database writer, analytics processor, notification engine, and retry machine. That is how latency becomes unpredictable.
Use a thin connection tier. Its job is to terminate connections, authenticate clients, route messages, track presence, and apply backpressure. Push heavier work into queues, streams, workers, or dedicated services.
AWS API Gateway WebSocket APIs, for example, publish clear quotas around connection rate and frame size, including default limits around new connections per second and WebSocket frame sizes. That kind of limit should shape your design before launch day, not during an incident.
Cloudflare Durable Objects offer another pattern: colocate stateful coordination close to users, while keeping idle connections lightweight. That is useful when your app has many idle-but-connected clients, like chat, collaboration, or presence-heavy products.
Shard by the Thing Users Actually Share
Do not shard randomly first. Shard by the unit of coordination.
For chat, that unit is often a room. For multiplayer, it is a match. For collaborative docs, it is a document. For trading, it might be an instrument or a topic. This keeps ordering local and avoids cross-node gossip for every message.
Here is the mental model:
| Workload | Bad shard key | Better shard key |
|---|---|---|
| Chat | user_id | room_id |
| Live docs | request_id | document_id |
| Multiplayer | region only | match_id plus region |
| Market data | connection_id | symbol/topic |
Worked example: say you have 100,000 connected users across 10,000 chat rooms. If one viral room has 8,000 users, CPU is not your first bottleneck. Fan-out is.
One 500-byte message becomes roughly 4 MB of outbound traffic for that room, before protocol overhead. At 20 messages per second, that one room can push around 80 MB/s. You do not fix that with random load balancing. You isolate the hot room, cap send buffers, compress carefully, and decide whether every client needs every event.
Kill Queues Before They Become Latency Debt
Queues are not bad. Hidden queues are bad.
Every real-time system has queues: socket send buffers, broker partitions, worker pools, database pools, retry queues, CDN edges, and client reconnect storms. When producers outrun consumers, latency grows even if the CPU looks fine.
Kafka-style systems make this visible through consumer lag. Slow consumers and throughput spikes are usually the first warning signs. Redis also ships latency monitoring tools because a cache that is “usually fast” can still create ugly p99 spikes.
Here’s how to control it:
- Set per-client send buffer limits.
- Drop low-value events under pressure.
- Coalesce updates, especially presence and cursor movement.
- Use idempotent message IDs for safe retries.
- Alert on p95 and p99, not averages.
The uncomfortable truth: sometimes the correct real-time behavior is to send less.
Scale With Backpressure, Not Just Autoscaling
Autoscaling helps, but it reacts after the load appears. Kubernetes HPA can scale workloads based on CPU, memory, or custom metrics, but infrastructure scaling alone is too slow to protect a sub-100 ms interaction loop.
Real-time apps need immediate local defenses: concurrency limits, admission control, per-topic rate limits, priority queues, and fast failure. Netflix’s concurrency-limits library is a good example of the pattern: measure latency, infer saturation, and reduce concurrency before the service melts.
The best production setup usually has both layers. Backpressure protects the system in milliseconds. Autoscaling adds capacity over seconds or minutes.
Put Users Near State, or Put State Near Users
Global real-time apps suffer when every event crosses an ocean.
Modern real-time platforms increasingly focus on edge routing, regional presence, failover, and message ordering because consistency matters more than one flashy benchmark number.
For your app, that means picking one of three paths:
Use regional rooms when collaboration is mostly local. Use global pub/sub when users truly span regions. Use edge state when coordination needs to stay physically close to clients.
No one really escapes physics. You can only choose where the speed-of-light tax shows up.
Measure the User-Visible Path
Do not stop at server metrics. Measure publish-to-deliver latency from the sender’s client to the receiver’s client. Add timestamps at the client, edge, broker, worker, and receiving client. Track reconnect time, missed messages, duplicate messages, and time spent in each queue.
Your dashboard should answer one question fast: “Where did the event wait?”
Real-time scaling becomes much less mysterious when you can see whether latency came from the broker, database, fan-out node, mobile network, GC pause, slow consumer, or reconnect storm.
FAQ
Should I use WebSockets, SSE, or WebRTC?
Use WebSockets for bidirectional app messaging, SSE for one-way server updates, and WebRTC data channels when peer-to-peer ultra-low-latency interaction matters. Most SaaS collaboration, chat, dashboards, and notifications fit WebSockets.
Is Kafka good for real-time apps?
Kafka is excellent for durable event pipelines, but it is not automatically a low-latency fan-out layer for connected clients. Use it behind the real-time edge, then bridge processed events into a pub/sub or connection layer.
What is a good latency target?
Pick by product behavior. Chat can often tolerate hundreds of milliseconds. Collaborative editing feels worse above roughly 100 to 200 ms. Games, auctions, and trading systems need stricter budgets. The key is to define p95 and p99 targets, not just median.
Honest Takeaway
Scaling real-time applications is mostly a fight against hidden queues. Keep connection handling thin, shard by coordination unit, apply backpressure early, measure tail latency, and move state closer to users where it matters.
The boring architecture wins: fewer cross-region hops, fewer surprise fan-outs, fewer unbounded buffers, and fewer services pretending averages tell the truth.
