You don’t notice latency until you do. Everything feels instantaneous, right up until your API starts returning in 800ms instead of 80ms, your checkout flow drops conversions, or your distributed system begins behaving like a set of loosely coordinated guesses.
At its core, network latency is the time it takes for data to travel from one point to another, measured in milliseconds. But in modern architectures, especially cloud-native, microservices-heavy systems, latency is no longer just a “network problem.” It is a system behavior that emerges from dozens of small decisions, routing paths, protocols, and dependencies.
If you’re building anything beyond a monolith, understanding latency is no longer optional. It is the difference between a system that scales cleanly and one that slowly collapses under its own complexity.
What Experts Are Actually Seeing in Production
We dug into recent talks, engineering blogs, and postmortems from companies operating at scale, and a consistent theme emerged.
Cindy Sridharan, distributed systems engineer, has repeatedly pointed out that engineers underestimate how latency compounds across service boundaries. A single 50ms hop becomes hundreds of milliseconds once you chain 10 services together. The system feels slow not because any one component is bad, but because everything is slightly imperfect.
Brendan Gregg, performance engineer at Netflix, emphasizes that most latency is not where engineers expect it. Teams often blame the network, but real bottlenecks show up in kernel queues, TCP retransmits, or even DNS lookups. In other words, what looks like “network latency” is often systemic.
Charity Majors, Honeycomb co-founder, has argued that modern observability reveals a harsh truth. Tail latency, not averages, is what users experience. The slowest 1 percent of requests define your reliability more than the median.
Put together, these perspectives suggest something uncomfortable but useful. Latency is not a single metric you optimize. It is a distributed property of your architecture, and small inefficiencies multiply quickly.
Latency Is Not Just Distance, It Is Physics Plus Decisions
Let’s start with the basics. Latency has a physical floor. Data cannot travel faster than the speed of light. Even in fiber, that is roughly 200,000 km per second.
A quick back-of-the-envelope example:
- New York to London is about 5,500 km
- Round-trip is 11,000 km
- Minimum latency is roughly 55ms
That is your theoretical best case. In reality, routing inefficiencies, switching, and congestion push that higher.
But in modern systems, distance is often not the dominant factor. Instead, latency is shaped by layers:
- Network propagation
- Transmission delays
- Queuing delays
- Processing delays
Most engineers optimize the first one. The last three are where the real gains are.
This is similar to how search engines evaluate relevance beyond keywords; they look at relationships between components, not just isolated signals. Latency behaves the same way. It is contextual, not isolated.
Why Latency Explodes in Modern Architectures
Monoliths had many problems, but latency was not usually one of them. Function calls were in progress. Data stayed local.
Microservices changed that.
Now, a single user request might look like this:
- API Gateway → Auth Service → User Service → Payment Service → Inventory Service → Notification Service
Each hop introduces:
- Network round-trip
- Serialization and deserialization
- Retry logic
- Load balancer decisions
Even if each service adds just 20ms, six services add 120ms. Add retries and tail latency, and you are suddenly at 300ms plus.
This is why distributed systems engineers talk about “latency budgets.” If your SLA is 200ms, you must allocate that budget across every service in the chain.
Without that discipline, latency grows invisibly.
The Mechanics That Actually Drive Latency
Let’s get concrete. When you measure latency in production, these are the usual suspects.
TCP Handshakes and Connection Setup
Every new connection requires:
- SYN
- SYN-ACK
- ACK
That is one full round trip before data even starts flowing. TLS adds more.
If your service opens new connections for every request, you incur this cost repeatedly.
Serialization Overhead
JSON is human-readable. It is also slow.
Switching from JSON to Protobuf or MessagePack can significantly reduce payload size and parsing time. At scale, this matters.
DNS Resolution
Surprisingly, DNS can add tens of milliseconds if not cached properly. In high-throughput systems, this becomes a hidden tax.
Queuing and Contention
Latency spikes often come from queues:
- Thread pools
- Database connections
- Message brokers
This is where tail latency emerges. Most requests are fast. A few wait in line and become slow.
Retries and Timeouts
Retries are essential for resilience. But they also multiply latency if not bounded.
A 100ms timeout with three retries can turn into 300ms quickly.
How to Actually Reduce Latency in Practice
Here’s where theory meets engineering tradeoffs. There is no single fix. You need layered strategies.
1. Collapse Unnecessary Network Hops
Start by mapping your request path.
Ask yourself:
- Can two services be merged?
- Can data be cached upstream?
- Can you precompute responses?
Pro tip: Many teams discover that 20 to 30 percent of service calls are avoidable.
2. Reuse Connections Aggressively
Use connection pooling and keep-alive.
This avoids repeated TCP and TLS handshakes. In high-QPS systems, this alone can cut latency by double-digit percentages.
3. Move Compute Closer to Data
Instead of:
- App → DB → App → Cache
Try:
- App → Cache (with precomputed data)
Or even push logic into the database when appropriate.
Reducing round-trip queries matters more than optimizing single queries.
4. Introduce Smart Caching Layers
Not all caching is equal.
- Edge caching reduces geographic latency
- Application caching reduces compute latency
- Database caching reduces I/O latency
A practical approach:
- Cache read-heavy endpoints with TTL
- Use write-through or write-behind strategies
5. Measure Tail Latency, Not Averages
Averages lie.
Track:
- P50
- P95
- P99
If your P99 is 10x your P50, you have a systemic issue.
This is where observability tools like Honeycomb, Datadog, or OpenTelemetry shine. They let you trace individual slow requests across services.
A Quick Comparison of Latency Optimization Strategies
| Strategy | Impact Level | Complexity | Best Use Case |
|---|---|---|---|
| Connection pooling | High | Low | API-heavy services |
| Caching (edge/app) | Very High | Medium | Read-heavy workloads |
| Service consolidation | High | High | Over-fragmented architectures |
| Protocol optimization | Medium | Medium | High-throughput systems |
| Geographic distribution | Medium | High | Global user bases |
What’s Still Hard and Uncertain
Even with all this, latency remains tricky.
- Cloud networks are opaque. You do not control routing paths.
- Multi-region consistency introduces tradeoffs between latency and correctness.
- Serverless adds cold start latency, which is still unpredictable.
And perhaps most importantly, user perception is nonlinear. A jump from 50ms to 100ms is barely noticeable. A jump from 300ms to 600ms feels broken.
No one has perfectly solved this. The best teams continuously measure, adapt, and simplify.
FAQ: Practical Questions Engineers Ask
What is “good” latency for modern systems?
It depends on the use case. APIs aim for sub-100ms. User-facing apps target under 200ms for responsiveness.
Is latency more important than throughput?
For user experience, yes. A fast system that handles fewer requests often feels better than a slow, high-throughput one.
Should you always use microservices?
Not necessarily. If latency is critical, fewer service boundaries often win.
How do CDNs help with latency?
They move content closer to users, reducing geographic distance and round-trip time.
Honest Takeaway
Latency is not something you fix once. It is something you manage continuously.
The biggest mistake you can make is treating it as a network metric. It is an architectural property. Every service boundary, every retry, every serialization format decision contributes to it.
If you take one idea from this, make it this: map your request path and assign a latency budget to every hop. That single exercise forces clarity. And clarity is what keeps distributed systems fast.

