The Hidden Bottlenecks That Break Microservices in Production

Microservices don’t fail because of one bottleneck. Small latencies, retries, and poor boundaries compound under load and cause cascading failures.

Anant Agarwal

May. 15, 26 · Analysis

Likes (1)

Comment

Save

1.7K Views

Most microservice systems don’t fail because they lack scalability. They fail because they were never designed to behave correctly under high load and stress.

A very common pattern for applications built using a micro-services architecture is this: everything runs quite normally for a long time. The architecture looks clean, services appear healthy, CI/CD tests are green, and monitoring dashboards do not raise alarms.

Then suddenly things start to go off the rails. Latency creeps into request-response paths, incidents become a nightmare to handle, and scaling efforts do not seem to help. You try to throw more instances at each microservice, but latency still does not improve. Usually, the suspected bottlenecks are not the ones actually causing the negative impact.

Most teams try to solve the problem the wrong way: add more pods, bigger instance sizes, more memory, more CPU. But that is not the core issue. The real problem is how these micro-services interact when they are under load and not performing as expected.

A Typical Request Flow for an E-Commerce Application

Client -> API Gateway -> Order Service -> Payment Service -> Inventory Service -> Database

As you can imagine, each arrow is an actual HTTP network call. Each call can add latency and can fail or queue under heavy load. Median latency dashboards often look just fine, but production traffic is not always median load. At p95 or p99, the story looks completely different.

If you look at simple pseudo-code for the order handler:

    Python
   
 

   def create_order(request):
    payment = call_payment_service(request.payment_details)
    check_inventory = call_inventory_service(request.items)

    if payment.success and check_inventory.success:
        return persist_order(request)
    else:
        return error_response()
  

It looks really straightforward, but each call above is sequential and waits for a response before proceeding to the next call. At low traffic that 8ms payment call and 12ms inventory call barely get noticed. But under load, those numbers do not stay at 8 and 12 anymore.

If you look at this deeply, each downstream dependency is part of your total response time, whether you account for it or not. When the inventory service slows during a flash-demand sale, your order service slows down too. When the order service slows, your gateway starts queuing requests. Scaling the order service at that point will not help. You are just creating more concurrent callers that hammer a downstream bottleneck.

The Dependency Chain Problem

Even though individual service latency may look harmless, it creates a compounding effect and changes the whole picture.

At median load, a five-service chain might look like this:

    Plain Text
   
   5ms + 8ms + 12ms + 6ms + 10ms = 41ms total

It seems fine. But the same path at p99 looks very different:

    Plain Text
   
   20ms + 50ms + 80ms + 30ms + 60ms = 240ms total

That is before retries come into play. If any service retries a failed request, you multiply the load on the downstream service. That downstream gets slower and slower, causes more timeouts, and then causes more retries. This can take a system from slightly degraded to completely down in under a minute.

The fix is not about eliminating retries. They are needed. The fix is making them controlled using exponential backoff with jitter:

    Java
   
 

   RetryPolicy policy = RetryPolicy.builder()
    .maxRetries(3)
    .backoff(100, 1000, ChronoUnit.MILLIS)
    .jitter(0.2)
    .build();
  

The jitter component is easy to skip, but it actually matters a lot. Without it, all the clients that time out at the same moment retry at the same moment, recreating the spike. Jitter spaces those retries out.

When Service Boundaries Start To Work Against You

There is a phase in many microservice projects where splitting things up feels right. The monolith becomes unmanageable, you pull services out, and everything feels cleaner for a while. Problems start to surface later.

A common example is pricing logic. In a product catalog, you might find:

    Plain Text
   
   Order Service -> Pricing Service -> Discount Service -> Tax Service -> Currency Service

That is a four-hop dependency chain for calculating an order total. Each hop is a network call. Total latency for something conceptually simple has now ballooned.

The issue is not that services are small. It is that the boundaries do not reflect how the domain actually works. Pricing, discounts, tax, and currency are not four independent concerns. They are one concern: calculating what something costs. When you split logic that is tightly coupled in reality into separate services, you end up with services that cannot function independently.

A more practical approach looks like this:

    Plain Text
   
 

   Order Domain Service
  ├── pricing engine
  ├── discount rules
  ├── tax calculation
  └── currency conversion
  

That is not a monolith. It just stops pretending that pricing and discounts need to be talked over the network.

Stateless Services

Horizontal scaling is straightforward when a service is stateless. Any instance can handle any request. You add instances under load and remove them when traffic drops.

The moment a service stores something locally, scaling becomes complicated. Session data in instance memory means a request landing on a different instance can fail or produce the wrong result. You need sticky sessions, which turn your load balancer into a stateful component. That creates a new bottleneck and a new failure mode.

A better approach is to externalize application state:

    Plain Text
   
   Service Instance A  ─┐
Service Instance B  ─┼──> Redis / Ignite cache
Service Instance C  ─┘

A distributed cache gives every instance access to shared state. The service stays stateless. You can spin up ten instances during peak traffic and wind them down later without coordination.

What teams often get wrong is treating the shared cache as an implementation detail. If Service A writes to a cache key that Service B reads, you now have a dependency outside any API contract. That turns into a debugging nightmare later.

The Shared Database Trap

Even when services are split at the application layer, sharing a database recreates the same coupling.

    SQL
   
   -- Both OrderService and ReportingService query this
SELECT * FROM orders WHERE user_id = ?;

If reporting runs an expensive aggregation while the order service is processing heavy traffic, they compete for the same resources. What looks like an order service latency problem turns out to be a reporting query without the proper index. You end up tuning the wrong thing for a long time.

Each service should own its data store. If two services need the same data, the better answer is usually one service publishing events that the other consumes asynchronously.

Back Pressure

Back pressure is an underutilized idea. A service that is close to capacity should tell its callers to slow down rather than silently queue work until it collapses.

Without back pressure, a slow downstream accepts requests until its thread pool is exhausted. Latency starts to spike, callers pile up, and the effect cascades upstream.

With back pressure, the downstream returns a 429 early before saturation. Callers can shed load or fail fast. The system degrades gracefully rather than collapsing.

Circuit breakers complement this well:

    Java
   
 

   CircuitBreaker breaker = CircuitBreaker.builder()
    .failureRateThreshold(50)
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .slidingWindowSize(10)
    .build();
  

This prevents a struggling service from draining the thread pools of everything calling it. The circuit opens, callers fall back, and the downstream gets breathing room.

Async for Non-Critical Work

Not everything needs to be completed before the user gets a response.

Order emails, audit logs, and analytics do not need to block the user-facing path. Moving this work off the synchronous path dramatically reduces the latency budget.

    JSON
   
 

   {
  "event": "order_created",
  "orderId": "8821",
  "traceId": "AAA-BBB-CCC-DDD"
}
  

Publishing takes microseconds. Downstream consumers handle it at their own pace. If the email service is slow, order creation is not affected.

Correlation Across Boundaries

One thing matters a lot here: correlation. Every event needs the same trace ID that the originating request used. Without that, you cannot follow a transaction when something goes wrong.

Debugging a problem that spans five services without a shared request ID is painful. You have logs, timestamps, and manual guesswork while hoping clocks are in sync.

The fix is trivial: establish a correlation ID at entry and propagate it everywhere.

    Python
   
 

   def with_correlation_id(func):
    @wraps(func)
    def wrapper(request, *args, **kwargs):
        correlation_id = (
            request.headers.get("X-Correlation-ID")
            or str(uuid.uuid4())
        )
        request.state.correlation_id = correlation_id
        return func(request, *args, **kwargs)
    return wrapper
  

Every outbound call includes this ID. Every log entry includes it. When an incident happens at 3 a.m., finding all log lines for a failing request becomes a single query instead of an hour of grep.

Distributed Tracing

Log-based correlation works for linear flows, but it gets harder as call graphs become more complex. Distributed tracing gives you a visual way to understand where the most time is getting spent:

    Plain Text
   
 

   Order Request (241ms total)
├── Auth (12ms)
├── Order handler (229ms)
│   ├── Pricing (8ms)
│   ├── Payment (180ms)   <-- here
│   │   ├── Fraud (40ms)
│   │   └── Gateway (140ms)
│   └── Inventory (22ms)
└── Serialization (9ms)
  

Without tracing, you know the order took 241ms. With distributed tracing, you know 140ms was spent inside the payment gateway. Teams without this visibility often optimize the wrong service.

Jaeger, Zipkin, and OpenTelemetry give you this with relatively little overhead. OpenTelemetry has become the standard because it works across backends and languages.

Chaos Testing

Most development environments are too predictable. Services start fast, calls succeed, and resources are available. Production is nothing like that.

Chaos testing introduces those conditions deliberately so you can see how the system responds before a real incident. A simple starting point is latency injection:

    YAML
   
 

   # Istio fault injection
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: payment-service
spec:
  http:
  - fault:
      delay:
        percentage:
          value: 25
        fixedDelay: 300ms
  

This adds 300ms of delay to 25% of payment calls.

What you are looking for is how the rest of the services react. Does order service latency climb? Do circuit breakers open? Does degradation spread?

The failures you find this way are almost always things you would not have anticipated from reading the code. Thread pools that seemed generous turn out to be undersized under retry pressure. Timeouts set to 30 seconds cause requests to queue long after the user has given up and retried.

Chaos Monkey goes further by randomly terminating instances during business hours. It sounds aggressive, but the point is simple: if your system cannot handle an unexpected termination during normal traffic, it definitely will not handle one well at 3 a.m.

Standardizing Behavior

One pattern that rarely gets mentioned as a scaling concern is inconsistency. When every service handles timeouts differently, the system becomes much harder to reason about.

If Service A has a 30-second timeout calling Service B, but Service B’s downstreams time out after 5 seconds, you have 25 seconds of waiting for a result that will never come. Those threads are tied up. At scale, that becomes a meaningful chunk of your pool doing nothing.

Timeouts should be set based on what callers actually need:

    Java
   
 

   HttpClient client = HttpClient.newBuilder()
    .connectTimeout(Duration.ofMillis(500))
    .build();

HttpRequest request = HttpRequest.newBuilder()
    .timeout(Duration.ofMillis(2000))
    .header("X-Correlation-ID", correlationId)
    .GET()
    .build();
  

The same applies to health checks. A health check that only returns 200 OK tells you almost nothing about whether the service can actually handle traffic. A meaningful health check verifies dependencies:

    Python
   
 

   @app.get("/health")
async def health_check():
    checks = {
        "database": await check_database(),
        "cache": await check_cache(),
        "payment_service": await check_payment_service()
    }

    all_healthy = all(c["status"] == "ok" for c in checks.values())

    return JSONResponse(
        content={"status": "ok" if all_healthy else "degraded", "checks": checks},
        status_code=200 if all_healthy else 503
    )
  

When the load balancer gets a 503, it stops sending traffic to that instance. Without that, it will happily keep routing traffic to an instance that cannot reach its database.

What Scales

After enough time working on distributed systems, the patterns become consistent. The systems that hold up under load are not necessarily the most impressive architecturally. They are the ones where a few important things were done consistently and early during the design phase.

Service boundaries are drawn around business domains, not around technical layers. Pricing and discounts live together because they belong together in reality.

Services should stay stateless. Local state creates invisible coupling. Externalizing the state adds a dependency, but it makes the system predictable when you scale.

Communication between services is minimized. Every call is both a latency point and a failure point. Calls that can be collapsed, cached, or made async should be.

Behavior is consistent. Retries, timeouts, error formats, and correlation IDs are standardized. When an incident happens, teams can follow a request without needing to know the conventions of each individual service.

Failures are tested regularly with realistic traffic. The difference between graceful degradation and collapse is often whether the team has already seen what happens when things go wrong.

None of this is exemplary. What separates systems that scale from systems that do not is mostly discipline: applying these ideas before they become urgent.

Closing Thoughts

Microservices do not necessarily make things simpler. They move complexity from inside services to the space between them. That complexity is manageable, but only if you take these principles and standards seriously early.

Teams that end up in trouble usually are not the ones that made obviously bad architectural decisions. They are the ones who built services that worked fine alone and then discovered the hard part was not the services themselves but how they behave together when traffic is real, and something is slow.

Follow requests across boundaries. Understand where time is spent. Make failure observable. The rest becomes easier.

Production (computer science) microservices

Opinions expressed by DZone contributors are their own.

Related

Trending