Heavy traffic does not usually break your system all at once. It starts politely. A few requests take longer. Then queues form. Then retries multiply. Then one slow backend turns into a small distributed stampede. Load balancing is the practice of spreading incoming requests across multiple servers, instances, regions, or services so no single part becomes the bottleneck. Under heavy traffic, the goal is not just “share the load.” The real goal is to keep tail latency, especially p95 and p99, from exploding.
Why Round Robin Stops Working When Traffic Gets Spiky
Round robin is fine when every request costs roughly the same, and every backend is equally healthy. That is rarely true in production.
One request might hit cache and finish in 8 ms. Another might call three downstream services and take 900 ms. If you distribute both evenly, you can still overload one server with slow in-flight work while another sits relatively free.
That is why latency-aware systems often use least outstanding requests, least request, or weighted routing instead of plain round robin.
| Technique | Best for | Latency benefit |
|---|---|---|
| Least outstanding requests | Uneven request duration | Avoids piling onto busy servers |
| Weighted routing | Mixed backend capacity | Sends more traffic to stronger nodes |
| Consistent hashing | Cache-heavy workloads | Improves cache locality |
| Adaptive concurrency | Overload protection | Caps in-flight requests dynamically |
Use Least-Request Routing Before You Add More Servers
A common mistake is treating latency as a capacity problem first. Sometimes it is. But often, you already have enough capacity, you are just feeding it badly.
With least-request routing, the balancer asks a better question: “Which backend looks least busy right now?” That matters when backend latency varies. Under load, one slow instance naturally accumulates more in-flight requests. A least-request algorithm notices that and sends new work elsewhere.
Here’s a simple worked example. Imagine 10 servers handling 1,000 requests per second. With round robin, each gets 100 requests per second. But if Server 4 starts taking 5x longer because of GC pauses or a noisy neighbor, its queue grows. At 100 requests per second, even a small increase in service time can push it past saturation. Least-request routing reduces new traffic to that instance while it recovers, keeping p99 latency lower for everyone else.
This is not magic. If every backend is saturated, least-request routing cannot invent capacity. But it prevents one bad node from quietly becoming the tail-latency tax on the whole fleet.
Add Health Checks, Outlier Detection, and Fast Ejection
Health checks answer the easy question: “Is this backend alive?” Outlier detection answers the more useful one: “Is this backend technically alive but hurting users?”
Modern load balancers and proxies can eject unhealthy or degraded targets from rotation. That means a backend does not need to fully crash before traffic moves away from it. Repeated 5xx errors, timeout spikes, or abnormal latency can all be treated as signs that the node should be temporarily removed.
The practical rule: do not wait for a server to die before you stop trusting it.
Use Consistent Hashing When Cache Locality Matters
For cache-heavy systems, random distribution can sabotage latency. If every request for the same user, object, or session lands on a different backend, your cache hit rate drops. Then your database absorbs the pain.
Consistent hashing routes related requests to the same backend based on a stable key, such as user ID, tenant ID, session ID, or object ID. That can reduce latency by improving cache locality and reducing repeated downstream lookups.
Use it carefully. Hashing can create hot spots when one tenant or key gets much more traffic than others. For multi-tenant SaaS systems, combine consistent hashing with safeguards like per-tenant rate limits, shard splitting, or weighted hashing.
Cap Concurrency Before Queues Become Latency Debt
Queues are latency in disguise. Once a backend accepts too many concurrent requests, every new request waits behind work that may already be doomed.
Adaptive concurrency tackles this directly. Instead of letting every backend accept unlimited work, the proxy or load balancer adjusts how many requests can be in flight based on observed latency. When latency rises, the system tightens the valve.
This works especially well with timeouts, retries, and circuit breakers. Without concurrency limits, retries can amplify overload. With limits, excess requests fail quickly or get shed before they turn the whole system into wet cement.
Route by Geography, Zone, and Capacity, Not Just Availability
Under heavy traffic, “available” is too low a bar. A region can be available and still slow.
Geo-aware and zone-aware load balancing reduce latency by keeping users close to compute, while capacity-aware routing prevents one region from absorbing more traffic than it can handle.
The nuance: cross-region failover protects availability, but it can increase latency. A slower response from another region beats an outage, but it should not become your normal path unless you designed for it.
Build a Latency-First Load Balancing Playbook
Start with metrics. Track p50, p95, p99, active connections, queue depth, retry rate, backend saturation, and per-target error rates. Average latency is a comfort blanket. Tail latency is where users feel the fire.
Then apply techniques in this order:
- Replace round robin with least-request or least-outstanding-requests.
- Add active health checks and passive outlier detection.
- Set timeouts, retry budgets, and circuit breakers.
- Add adaptive concurrency or explicit in-flight request caps.
- Use consistent hashing only where locality improves performance.
The key idea is simple: a load balancer is not a traffic sprinkler. It is the control plane for user experience.
FAQ
What load-balancing algorithm gives the lowest latency?
Usually, the least-requested or least-outstanding requests for dynamic workloads. For cache-heavy workloads, consistent hashing may win.
Should I use retries under heavy traffic?
Yes, but only with budgets, jitter, and timeouts. Unlimited retries create more load exactly when your system has the least room.
Is horizontal scaling enough?
Not always. More servers help only if traffic distribution, health checks, concurrency limits, and downstream capacity keep up.
Honest Takeaway
The fastest systems do not just add more machines. They route smarter, shed earlier, and treat p99 latency as a first-class reliability signal.
Start with least-request routing and real health-aware ejection. Then add adaptive concurrency and retry discipline. That combination usually cuts more latency under heavy traffic than another panicked autoscaling rule.

