Performance Resources

DZone's Featured Performance Resources

Cloud Cost Optimization Was Hard; AI Cost Optimization Will Be Worse.

By Raghava Dittakavi

CORE

For the last decade, cloud cost optimization has been one of the most painful disciplines in enterprise technology. Every CTO, CIO, Head of Engineering, platform leader, and FinOps team knows the story. The cloud made infrastructure faster, more flexible, and more scalable. But it also created a new problem: spending became too easy and unnoticed. An engineer could launch compute in minutes.A team could overprovision storage without realizing it.A forgotten environment could quietly burn money for months.A poorly tagged workload could make cost accountability almost impossible to identify. That was the first era of cloud financial discipline. We learned to manage it through rightsizing, tagging, reserved instances, savings plans, autoscaling, storage lifecycle policies, unit economics, chargeback, showback, and FinOps governance. It was difficult. But compared to AI, traditional cloud cost optimization may look simple. AI is introducing a new cost model that most enterprises are not ready for. And the companies that fail to understand this early will not just overspend. They will struggle to prove AI ROI. The Cloud Cost Problem Was Mostly Infrastructure Visibility Traditional cloud cost problems were usually tied to infrastructure waste. Oversized computeIdle resourcesUnused storageOver-retention of logsPoor environment hygieneLack of ownershipWeak forecastingNo accountability between engineering and finance These problems were hard, but they were measurable (and with the right discipline, they are solvable; I have seen the benefits personally). You could look at CPU utilization.You could identify unattached volumes.You could review storage growth.You could analyze I/O patterns.You could map spend to teams, products, environments, and customers. Cloud costs were complex, but at least the cost drivers were relatively visible. AI changes that. AI cost is not just infrastructure cost. It is the usage cost.It is the token cost.It is GPU cost.It is data cost.It is an experimentation cost.It is a model-selection cost.It is an agent-loop cost.It is an observable cost.It is a governance cost.It is the cost of mistakes made by systems that can now act, not just respond. That is a very different engineering-to-financial problem. The AI Cost Curve Will Surprise Many Enterprises The FinOps Foundation’s 2026 State of FinOps research shows how quickly this shift is happening: 98% of surveyed organizations now manage AI spend, up from 31% two years earlier, and AI cost management is now the number-one skill set FinOps teams need to develop. That is the beginning of a new operating discipline. Gartner has also forecast that worldwide AI spending will reach $2.5 trillion in 2026, with AI-optimized servers growing sharply as enterprises and technology providers build the foundation for AI adoption. McKinsey has estimated that the AI data center buildout alone could require $5.2 trillion in investment by 2030 to meet projected demand. These numbers matter because they point to a simple reality: AI is not just a software feature. AI is becoming an infrastructure economy, and every infrastructure economy eventually faces a cost discipline problem. Why AI Cost Optimization Is Harder Than Cloud Cost Optimization Cloud cost optimization was mostly about resource efficiency. AI cost optimization is about decision efficiency. That distinction matters. In traditional cloud, the question was: “Are we using the right amount of infrastructure for this workload?” In AI, the question becomes: “Are we using the right model, with the right context, for the right task, at the right level of reasoning, with the right data, at the right cost, for the right business outcome?” That is much harder. A simple AI feature can create hidden cost multipliers: A long prompt increases input tokens.A long answer increases output tokens.A large context window increases cost.A reasoning model may consume more compute.An agent may call multiple tools.A failed agent may retry repeatedly.A RAG workflow may increase vector database and storage costs.A poorly designed workflow may call a premium model when a smaller model would work.A high-volume internal assistant may become expensive before anyone connects usage to business value. This is where many organizations will get hurt; not because AI does not work, but because AI works just enough to spread quickly before the cost model is mature. The Real Risk Is Not AI Spend. It Is Unmeasured AI Spend. Spending money on AI is not the problem; unmeasured AI is. A company can justify a high AI bill if it clearly improves revenue, productivity, compliance, reliability, customer experience, or engineering velocity, but many organizations will not have that clarity. They will know the invoice. They will not know the value. That is dangerous. The next generation of AI governance cannot stop at model safety and data privacy. It must include economic governance. Every serious enterprise AI platform will need answers to questions like: Which team is consuming the most AI spend?Which product feature is driving the most token usage?Which customers are creating the highest AI cost-to-serve?Which prompts are inefficient?Which agents are looping?Which models are overpowered for the task?Which workflows should use caching?Which workloads need premium models, and which can use smaller models?Which AI use cases are producing measurable business value? Without this visibility, AI becomes another uncontrolled cloud bill — only faster, more abstract, and harder to explain. The New Discipline: AI FinOps Cloud FinOps brought engineering, finance, and business teams together to manage cloud value. AI FinOps will need to go further. It must connect four layers: Infrastructure economics. GPU usage, compute utilization, storage, networking, inference endpoints, vector databases, model hosting, and cloud-native scaling.Token economics. Input tokens, output tokens, context windows, prompt size, reasoning depth, retry behavior, and agentic tool calls.Application economics. Cost per workflow, cost per customer, cost per ticket, cost per deployment, cost per document processed, cost per support case, or cost per transaction.Business economics. Revenue impact, productivity gain, risk reduction, cycle-time reduction, customer experience improvement, and operational leverage. The companies that master AI FinOps will not be the ones that simply reduce AI spend. They will be the ones that understand which AI spend deserves to grow. That is the maturity shift. Cost optimization should not mean “spend less.” It should mean “spend intelligently.” The Mistake: Treating AI Cost Like a Vendor Invoice Problem Many companies will initially treat AI cost management as a procurement problem. They will negotiate model pricing. They will compare vendors. They will look for cheaper tokens. They will cap usage. They will ask finance to control the bill. That will help, but it will not be enough. The biggest AI cost decisions are not made in procurement, but in architecture. They are made when engineering teams decide: Which model to useHow much context to sendWhether to cache responsesHow agents should retryHow much history to includeHow retrieval should workHow evaluation should gate changesHow observability should track usageHow workflows should fail safely AWS’s Generative AI Lens also frames cost optimization as an architectural discipline, not just a billing exercise. This is the correct direction. AI cost optimization must move left. It has to be designed into the platform. The Next Executive Question For years, executives asked: “What is our cloud spend?” Then the better question became: “What is our cloud spend per product, customer, environment, and business outcome?” Now AI forces a new question: “What is our AI cost per decision, per workflow, per customer, and per unit of business value?” This question will separate mature AI organizations from experimental ones, because AI adoption without cost intelligence is not transformation. It is uncontrolled automation. What Leaders Should Do Now Enterprises do not need to slow down AI adoption, but they do need to stop pretending AI cost can be managed later. The right move is to build the financial control plane early. Start with five actions: Tag and attribute AI usage from day one. Every AI call should be connected to a team, product, environment, use case, and business owner.Measure unit economics. Do not only track total AI spend. Track cost per workflow, per user, per transaction, per ticket, and per successful outcome.Create model-routing standards. Not every task needs the most powerful model. A mature platform should route work across premium models, smaller models, open-source models, cached responses, and deterministic automation.Monitor agent behavior. Agentic systems need cost guardrails. Tool calls, retries, loops, memory usage, and context expansion must be observable.Connect AI spend to business value. If a use case cannot show measurable value, it should not receive unlimited scale. This is not about slowing innovation. It is about preventing AI from becoming the next uncontrolled infrastructure wave. The Future Belongs to Economically Intelligent AI Platforms The first era of cloud rewarded companies that could move fast. The second era rewarded companies that could move fast and control cost. The AI era will reward companies that can move fast, control cost, measure value, and govern autonomous systems. That is a much higher bar. The winners will not be the companies with the most AI pilots. They will be the companies with the strongest AI operating model. They will know what to automate.They will know what not to automate.They will know which models to use.They will know where the money is going.They will know where AI is creating value.They will know when AI is simply creating activity. Cloud cost optimization was hard because cloud made infrastructure consumption easy. AI cost optimization will be worse because AI makes decision consumption easy, and decisions, at enterprise scale, are far more expensive than servers. The next great discipline in technology leadership will be making AI economically sustainable. That is where AI transformation becomes real More

Observability for AI Agents and Multi-Agent Systems: When Your System Can't Tell You Why It Did That

By Pruthvi Raj Seknametla

The bug report was received as a customer complaint. An AI agent responsible for managing vendor onboarding had sent a rejection email to a supplier the company had been trying to close for three months. Nobody had authorized it. Nobody had configured it to reject vendors in that category. The agent autonomously made the decision after analyzing a compliance document and cross-referencing it with an internal policy database. By the time the complaint arrived, the reasoning chain that produced the decision had been discarded. The agent had no memory of why it did what it did. The logs showed the action but not the thought. That story is fictional in its specifics but accurate in its structure. This phenomenon represents a class of problems that teams deploying AI agents in production are encountering with increasing frequency: the agent performed an action, the output is visible, but the intermediate reasoning, including the sequence of context retrievals, model calls, tool invocations, and decisions that led to the output, is either absent, incomplete, or stored in a format that renders post hoc investigation nearly impossible. Traditional observability was not designed for systems that exhibit cognitive processes. Why Agent Observability Is Structurally Different Conventional service observability is built around a relatively stable model: a request enters a system, passes through a defined set of operations, and produces a response. The execution path may be complex, but it's deterministic and bounded. You can instrument each step, correlate the signals with a trace ID, and reconstruct exactly what happened for any given request. AI agents break this model in at least three ways. First, the execution path is not determined at design time — it emerges from the agent's reasoning. An agent deciding which tools to call, in what order, based on what it reads in a retrieved document, is making structural decisions at runtime that a static trace can't fully capture. The spans exist, but the semantic reason a particular branch was taken lives inside a model call that returned natural language, which most tracing systems treat as an opaque blob. Second, agent systems frequently involve state that persists across requests: memory stores, retrieved context, and conversation history, which means the behavior of the system at time T is partially determined by things that happened at times T-1 through T-n. Debugging a poor decision often requires reconstructing not just the current request but the accumulated state that shaped it. Most observability stacks are not built for these scenarios. Third, multi-agent systems introduce the problem of causal attribution across agent boundaries. When Agent A passes a task to Agent B, which delegates a subtask to Agent C, which calls a tool that returns erroneous data, and that incorrect data propagates back up the chain to produce a wrong output from Agent A, the causal chain is real but fragmented across three separate execution contexts. Without deliberate design, you'll have three separate traces with no shared context that links them. The Minimum Viable Agent Trace The starting point for any serious agent observability implementation is defining what the minimum viable trace looks like for a single agent execution. In practice, this means capturing five things that standard OpenTelemetry spans don't cover by default. The first is the full prompt context, not just the user message but the complete input to each model call, including the system prompt, retrieved documents, tool outputs injected into the context, and the conversation history. The information is costly to store and verbose, but you need it to understand the model's reasoning. Sampling helps here: store full prompt context for a percentage of executions, prioritizing those that result in high-stakes actions or errors. The second is the model's reasoning output before tool calls. If your agent framework supports it, capture chain-of-thought or scratchpad outputs of the model's intermediate reasoning before it decides to call a tool or produce a final answer. This is the closest thing to a stack trace for a reasoning system. Without it, you can see that a tool was called but not why. The third is a tool called "provenance" for each tool invocation, recording not just the inputs and outputs but which part of the reasoning chain triggered it. Fourth is the agent's decision points: moments where the agent chose between multiple possible actions. Fifth is cross-agent delegation context: when one agent hands off to another, the receiving agent's trace must carry a reference to the delegating agent's trace ID. Python # Minimal agent span instrumentation using OpenTelemetry from opentelemetry import trace import json tracer = trace.get_tracer('agent.core') def traced_model_call(agent_id, prompt_context, step_label): with tracer.start_as_current_span(f'agent.model_call.{step_label}') as span: span.set_attribute('agent.id', agent_id) span.set_attribute('agent.step', step_label) # Store truncated prompt for cardinality control span.set_attribute('agent.prompt_hash', hash(str(prompt_context))) span.set_attribute('agent.prompt_len', len(str(prompt_context))) # Full prompt stored separately in blob storage, keyed by trace+span ID store_prompt_context( trace_id=format(span.get_span_context().trace_id, '032x'), span_id =format(span.get_span_context().span_id, '016x'), context =prompt_context ) response = call_model(prompt_context) span.set_attribute('agent.output_len', len(response)) span.set_attribute('agent.tool_calls', extract_tool_calls(response)) return response The pattern above separates high-cardinality content (the full prompt) from the trace span itself, storing it in blob storage keyed by trace and span IDs. This keeps the tracing backend manageable while preserving the ability to retrieve full context for any specific execution. The prompt hash allows you to detect when two executions were given identical contexts, which is useful for identifying cases where the same input produced different outputs, which is a diagnostic signal in itself. Multi-Agent Correlation: The Delegation Chain Problem Here's where things got genuinely complicated in a system I was involved with: we had three agents — a planning agent, a research agent, and a writing agent that collaborated on generating reports. Each was instrumented individually and produced clean traces. But when a report came out wrong, reconstructing which agent's decision caused the problem required manually cross-referencing three separate trace trees, none of which had a shared parent. The fix was implementing what we called a "workflow ID," a UUID generated at the entry point of any multi-agent task and propagated explicitly to every agent that participated in that task, regardless of how many hops away from the origin they were. This workflow ID was added as a span attribute on every agent span and as a field in every log line produced during the task. With it, querying all spans and logs associated with a single end-to-end agent workflow became a single filter, not a manual correlation exercise. Python # Propagating workflow context across agent boundaries from dataclasses import dataclass from opentelemetry import trace, context, propagate @dataclass class AgentWorkflowContext: workflow_id: str # stable across all agents in a task parent_agent: str # which agent delegated this task delegation_depth: int # how many hops from the origin agent def delegate_to_agent(target_agent, task, wf_ctx: AgentWorkflowContext): child_ctx = AgentWorkflowContext( workflow_id = wf_ctx.workflow_id, # same ID propagates parent_agent = wf_ctx.parent_agent, delegation_depth = wf_ctx.delegation_depth + 1 ) span = trace.get_current_span() span.set_attribute('workflow.id', child_ctx.workflow_id) span.set_attribute('workflow.depth', child_ctx.delegation_depth) span.set_attribute('workflow.parent_agent', child_ctx.parent_agent) return target_agent.run(task, child_ctx) The delegation depth attribute turned out to be more useful than expected. In one debugging session, seeing that a particular tool call was happening at delegation depth 4 — four hops from the original request immediately flagged that the agent system had gone significantly deeper into a recursive subtask chain than intended. Without that attribute, the trace looked like any other tool call. Semantic Logging: What Happened vs. Why Standard logging captures what happened. For agent systems, you also need to capture the agent's stated reasoning at key decision points. This doesn't require exotic infrastructure; it requires a logging discipline that treats the model's reasoning output as a first-class log field rather than as data to be discarded after use. In practice, this means that when an agent produces a reasoning step leading to a significant action — such as calling an external tool, delegating to another agent, producing a final output, or deciding to abandon a task — the full reasoning text should be logged alongside the action. Tag it with the workflow ID, the agent ID, and a decision type label. This produces a semantic audit trail that lets you answer the question, "Why did the agent do X?" without having to reconstruct it from indirect evidence. The objection is storage cost, and it's legitimate. Reasoning outputs from LLMs are verbose. Storing them for every execution at scale is expensive. The practical answer is tiered retention: store full reasoning logs for executions that result in errors, high-stakes actions (anything that sends an external communication, modifies a record, or triggers a financial transaction), or random sampling of normal executions for baseline calibration. For the rest, store only the decision label and the action taken. This keeps costs manageable while preserving investigative capability for the cases that matter. What I'd Do Differently In hindsight, the single most important decision to make before deploying an agent in production is defining what a 'high-stakes action' means for that specific agent and ensuring those actions always produce full semantic logs regardless of cost. Initially, we did not define logging requirements; instead, we treated logging as uniform across all action types, which resulted in issues when an agent took an unexpected external action, and we lacked a reasoning log to explain it. I'd also invest earlier in a replay capability: the ability to take a logged prompt context and re-run the agent over it with a modified model or prompt configuration to verify that a fix actually changes the behavior that caused a problem. Without a replay capability, any changes you make are based on hope rather than verification. With it, you can verify that the reasoning path actually differs before deploying. When should you not build this level of observability? If you're prototyping or running an agent in a low-stakes, easily reversible context, the overhead of full semantic logging and workflow ID propagation is probably premature. Build it before you go to production with consequential actions, not after. The cost of retrofitting it once an unexplained agent decision has already caused a real problem is significantly higher than building it in from the start. Key Takeaways Standard distributed tracing captures what happened in agent systems but not why. Semantic logging of reasoning outputs at decision points is the missing layer; treat it as first-class infrastructure, not optional verbosity. Propagate a workflow ID across all agents in a multi-agent task. Without it, correlating signals across agent boundaries requires manual effort that fails under incident pressure. Separate high-cardinality prompt content from trace spans. Store the full prompt context in blob storage keyed by trace and span ID, and reference it from the span. This preserves investigative capability without bloating your tracing backend. Please define high-stakes actions prior to deployment and ensure they consistently generate complete semantic logs. The executions you most need to investigate are exactly the ones where missing reasoning context is most detrimental. Conclusion Observability for AI agents is not a solved problem. The tooling ecosystem is immature, the standards are still forming, and most teams are improvising solutions on top of infrastructure designed for deterministic services. That's not a reason to skip it; it's a reason to be deliberate about what you build, because the defaults will leave you blind at exactly the wrong moment. The deeper challenge is that agent observability isn't just a technical problem. It's also an accountability problem. When an AI agent takes a consequential action, someone needs to be able to answer the question of why, not just for debugging purposes, but for the humans affected by the decision and for the organization responsible for the system. A vendor who received a rejection email deserves a better answer than "the agent decided that." The infrastructure to produce that answer has to be designed in, not bolted on. The open question I keep returning to: as agent systems become more capable and their reasoning chains longer and more complex, at what point does the volume and opacity of their decision-making exceed our practical ability to observe and understand it? We may be building systems that are genuinely difficult to audit, not because of missing tooling but because of fundamental limits on human comprehension of long reasoning chains. What does accountability look like then? More

Does 100% Code Coverage Mean Tested?

By Stelios Manioudakis

CORE

Debugging and Performance Tuning in Pega Using PAL, Tracer, and Clipboard

By Anil guntupalli

Scaling Teams, Scaling Systems: Unlocking Developer Productivity With Platform Engineering

By Ammar Husain

CORE

Differential Flamegraphs in Java in Jeffrey Microscope

In the first article, we got started with Jeffrey Microscope and learned to read a single flamegraph — the timeseries, search, tooltips, and the allocation and wall-clock variants. This time we build directly on that foundation and tackle one of Jeffrey's most powerful features for real-world performance work: the differential flamegraph, which compares two recordings and shows you precisely what changed between them. A single flamegraph tells you where your application spends its time. But the questions that matter most in practice are comparative: Did my optimization actually help?What did this refactor make slower?Where did the extra allocations come from? Staring at two flamegraphs side by side and trying to spot the difference by eye is slow and error-prone — the graphs are large, and the interesting change is often a few frames buried deep in the stack. Jeffrey Microscope's differential flamegraph solves this by overlaying two recordings into a single graph and coloring every frame by how it changed: Red – where the primary profile spends more than the baseline (a regression).Green – where it spends less (an improvement).Deeper shades – brand-new and fully-removed frames, called out distinctly. In this article, we'll take the two recordings from the previous post — the optimized direct serialization path and the garbage-heavy DOM path — set one as a secondary profile, and let the differential view pinpoint exactly which methods account for the difference. We start exactly where the first article left off. Open the optimized recording, jeffrey-persons-direct-serde-cpu.jfr.lz4, and head to the Visualization tab — this is our primary profile, the same CPU flamegraph we explored last time. On its own, it shows where the direct serialization path spends its time, but to turn it into a comparison we need a second recording to diff it against. That's what the Secondary Profile slot in the top bar is for — currently marked NOT SET. In the next step we'll point it at the DOM-based recording and unlock the Differential view in the sidebar. Supported Events Types With the secondary set, the Differential page mirrors the Primary one — a card per event type — but each now shows both sides at once. The value on the left is the baseline (the secondary profile), the value on the right is the primary, and the badge is the relative change from one to the other: a red +N% means the primary has more of that event than the baseline (grew), a green −N% means it has less (shrank). This lets you gauge the overall shift before opening a single graph — whether the change is a rounding-error wobble or a real regression worth investigating. Jeffrey supports differential flamegraphs for every sample-based event it can render normally: Execution Samples – total CPU work. More samples means more time spent on-CPU (37.3K → 39.7K, +6.4% here).Wall-Clock Samples – elapsed time including waiting and blocking, which can move independently of CPU (5.0M → 4.4M, −12.4%).Allocation Samples – memory pressure; switch Use Total Allocation to compare bytes rather than sample count and see the true allocation cost (27.47 GiB → 30.45 GiB, +10.9%).CPU-Time Samples and Method Traces – empty here, but diff identically when the recordings contain them. Each of these numbers is just the headline; the flamegraph below breaks the same delta down frame by frame, so you can see which methods drove it. Click View Flamegraph on the Execution Samples card to open the differential CPU view. Reading the Differential Flamegraph Opening the differential view feels familiar — same timeseries, search, and tooltip as a normal flamegraph — but everything now encodes two profiles at once: The summary bar at the top reports the totals side by side: baseline 35,472 vs primary 39,668, a net +4,196 (+11.83%) flagged as REGRESSED. That's the headline — the primary run did more on-CPU work overall.The timeseries overlays both recordings as two lines — Primary in blue, Secondary (baseline) in red — so you can see where in time the profiles diverge, not just that they differ.The flamegraph colors encode the per-frame change: pale pink/green for frames that shifted a little, and saturated deep red/deep green for frames that exist in only one profile — brand-new work versus work that disappeared entirely. The payoff is in the last two screenshots. Because the optimized and unoptimized paths run through differently-named classes, the diff renders them as a matched pair: the deep-red EfficientPersonService.getNPersons subtree (new in the primary) sitting right next to the deep-green InefficientPersonService subtree (gone from the primary). You're literally seeing the code swap, top to bottom. And hovering a shared frame quantifies it precisely — the tooltip on PersonController.getNPersons shows baseline 854 → primary 525, an IMPROVED −329 (−38.52%) for that endpoint's own path. The differential CPU flamegraph overlays both recordings: the timeseries plots the primary (blue) against the secondary baseline (red), and the summary bar reports baseline 35,472 → primary 39,668, a net +4,196 (+11.83%) marked REGRESSED. The merged flamegraph colors every frame by its change. The shared Tomcat, Coyote, and Spring layers stay mostly pale pink — small shifts — while the summary bar keeps the overall +11.83% delta in view. The flamegraph also captures the JVM's own threads, not just your request path — the CompileBroker / C2Compiler stacks on the left are JIT compilation, and garbage-collection activity shows up the same way. Comparing them across the two recordings tells you whether either run triggered extra spikes in JIT or GC work, a common hidden cost when one version allocates more or churns more code. Deeper into the stack, the two implementations separate out: saturated red columns mark work that is new in the primary profile, while the deep-green columns are paths that existed only in the baseline and disappear in the primary. The optimized EfficientPersonService path (red, added) sits beside the removed InefficientPersonService path (green). Hovering the shared PersonController.getNPersons frame quantifies the change exactly: baseline 854 → primary 525, an IMPROVED −329 (−38.52%). Summary From here, try the same workflow on the Wall-Clock and Allocation differential flamegraphs — the steps are identical, and each reveals a different dimension of the change: time spent waiting, and bytes allocated. Thank you for reading! To go deeper, visit the Jeffrey pages, or reach out to me directly on LinkedIn — I'd love to hear your feedback. And stay tuned: in the next article, we'll step away from flamegraphs and explore one of Jeffrey's JVM Internals views to dig into what the runtime does under the hood.

By Petr Bouda

CORE

Building Evaluation, Cost Governance, and Observability for a Multi-Agent System in Microsoft Foundry

This closes out the series' capstone: the multi-agent customer support system built across Parts 6-9, now hardened with evaluation, cost governance, and observability so it can actually run in production with an on-call rotation behind it, not just in a demo environment. Continuous Evaluation Pipeline Evaluation: Measuring Quality Continuously, Not Just at Launch A one-time eval before launch tells you nothing about drift once real traffic — and real edge cases — start hitting the system. Set up a continuous evaluation pipeline using a G-Eval-style approach, where a separate model scores production outputs against explicit criteria: Python eval_criteria = { "correctness": "Does the response accurately reflect the order/refund status retrieved from the tools?", "escalation_appropriateness": "If the case was ambiguous or high-risk, did the agent escalate to a human rather than resolving it alone?", "tone": "Is the response professional and appropriately empathetic given the customer's stated frustration level?", } def geval_score(response, context, criterion_name, criterion_description, eval_model_client): prompt = f"""Evaluate the following response against this criterion: {criterion_description} Context: {context} Response: {response} Score from 1-5 and give one sentence of reasoning. Return JSON: {{"score": int, "reasoning": str}""" result = eval_model_client.complete(prompt) return json.loads(result) def run_continuous_eval(sample_of_production_traffic): scores = {crit: [] for crit in eval_criteria} for interaction in sample_of_production_traffic: for crit_name, crit_desc in eval_criteria.items(): result = geval_score(interaction.response, interaction.context, crit_name, crit_desc, eval_model_client) scores[crit_name].append(result["score"]) return {crit: sum(vals) / len(vals) for crit, vals in scores.items()} Sample a percentage of real production traffic daily (not just synthetic test cases) and track these scores over time. A drop in escalation_appropriateness specifically is the metric most worth alerting on — it's a direct proxy for the system doing something risky without a human check, which is exactly the failure mode the recovery and authorization work in Parts 7 and 9 was designed to prevent. Cost Governance: PTU vs. Pay-as-You-Go, Decided With Real Math For a system with predictable, sustained traffic (which a production support system should have), provisioned throughput (PTU) usually beats pay-as-you-go on cost — but the crossover point depends on your actual volume: Python def compare_ptu_vs_payg(monthly_token_volume, ptu_monthly_cost, payg_per_1k_tokens): payg_monthly_cost = (monthly_token_volume / 1000) * payg_per_1k_tokens return { "payg_monthly": payg_monthly_cost, "ptu_monthly": ptu_monthly_cost, "recommendation": "ptu" if ptu_monthly_cost < payg_monthly_cost else "payg", "breakeven_tokens": (ptu_monthly_cost / payg_per_1k_tokens) * 1000, } Run this quarterly, not once — traffic volume for a maturing production system tends to grow, and the PTU crossover point is usually reached faster than teams expect once an agent system is handling a meaningful fraction of real support volume. Chargeback Tagging: Attributing Cost to the Right Owner With multiple agents (fraud-check, refund, notification) potentially running on shared compute, tag at the project level so cost attribution doesn't require manual reconciliation later: Python resource_tags = { "business-unit": "customer-support", "system": "multi-agent-refund-flow", "environment": "production", "cost-center": "CC-4471", } Apply these consistently at the Azure resource level (not just in application logs) so cost management reports can be filtered directly without a separate reconciliation step — this is the difference between a chargeback model that's usable monthly versus one that requires a data-engineering project every quarter. Dashboard signalSourceWhat it indicatesRequest-level tracePart 2 tracing patternsLatency and failure location per agent stepAuthorization denialsPart 9 identity loggingPotential security issue, not just a bugEscalation rate vs. appropriateness scoreEval pipeline + agent logsWhether the system is escalating correctlyCost burn rateAzure Cost Management tagsBudget overage risk before month-end Observability: The On-Call-Ready Dashboard Pull together the tracing work from Part 2, the authorization logging from Part 9, and the eval scores above into a single dashboard an on-call engineer can actually use at 2 am: Request-level trace: which agents were invoked, in what order, with what latency per step (from Part 2's tracing patterns).Authorization denials: any agent attempting an action outside its scope (from Part 9) — a spike here is a security signal, not just a bug signal.Escalation rate: percentage of interactions escalated to a human, tracked against the eval-measured escalation_appropriateness score — a rising escalation rate paired with a falling appropriateness score means the system is escalating things it shouldn't, which is its own kind of problem.Cost burn rate: token consumption against the PTU/PAYG budget, with an alert threshold before month-end overage becomes a surprise. A Concrete Incident: What the On-Call Runbook Actually Looks Like All the observability infrastructure above is only as good as the runbook someone follows at 2 am when an alert fires. Here's a worked example tying every prior post together into one incident response flow, using a realistic trigger: the escalation-rate alert from the dashboard fires, showing escalations up 3x over baseline in the last 30 minutes. Step 1 — check the authorization denial log (Part 9). A spike in escalations correlated with a spike in authorization denials usually means an agent is attempting actions outside its scope — possibly a misconfigured deployment, possibly a prompt-injection attempt. This is checked first because it's the highest-severity possible cause. Step 2 — check the circuit breaker state (Part 7). If a downstream dependency (the fraud-check API, say) is degraded, the circuit breaker should already be routing to human escalation rather than retrying — confirm it's open and working as designed, not that agents are timing out repeatedly without the breaker engaging. Step 3 — check the eval scores for escalation_appropriateness (this post). If the score is stable and escalations are simply more frequent, this may be a legitimate traffic pattern (a genuinely higher-risk cohort of requests, e.g., during a known incident like a payment processor outage) rather than a system problem. If the score is dropping alongside the escalation spike, the system's judgment about when to escalate may itself be degrading — this points back toward Part 5's schema validation and Part 7's handoff logic as places to check for a recent regression. Step 4 — check recent deployments against the canary process (Part 2). Cross-reference the timestamp of the spike against any recent flow, model version, or schema change. If a change went out in the last few hours without full canary ramp-up, that's the most likely single cause, and rollback is usually faster than root-causing forward. Python def incident_triage(alert_context): checks = [ ("authorization_denials", check_authorization_spike), ("circuit_breaker_state", check_circuit_breaker_status), ("eval_score_trend", check_escalation_appropriateness_trend), ("recent_deployments", check_recent_flow_changes), ] findings = {} for name, check_fn in checks: findings[name] = check_fn(alert_context) if findings[name].get("severity") == "critical": return {"triage_result": name, "findings": findings, "action": "immediate_rollback_or_escalation"} return {"triage_result": "inconclusive", "findings": findings, "action": "manual_investigation"} Writing this ordering down explicitly — check security signals before assuming it's a quality regression, check for a bad deploy before deep root-causing — is what turns nine posts' worth of individually reasonable safeguards into something an on-call engineer who didn't build the system can actually execute under pressure. References Azure AI Foundry evaluation SDK: https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/develop/evaluate-sdkG-Eval and LLM-as-judge evaluation approach: https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/observabilityProvisioned Throughput Units (PTU) for Azure OpenAI: https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/provisioned-throughputAzure Cost Management and tagging: https://learn.microsoft.com/en-us/azure/cost-management-billing/costs/cost-mgt-best-practices

By Jubin Abhishek Soni

CORE

Building Cross-Team SLO Contracts for Performance Accountability

If you have ever developed a popular website in a microservice architecture, then most likely you have come across this case when, at first, your latency seems to increase. Then you check your dashboard and notice that the P90 latency has increased by 300ms in the past two weeks. It is very likely you will want to analyze the individual spans and see that one of your upstream dependencies, which belongs to another team, is slower. You could write them a ticket or even page them to fix the problem. They agree to it, but they also say that all is well on their end. Still, your service-level objective (SLO) is being violated. This is exactly the accountability gap that latency alone cannot solve. The page owner is the one who is responsible for the end-to-end latency of the page and may also have some dependencies which they do not have complete control over. However, the dependency owner does not have a formal agreement to maintain a specific latency profile per service, and quite fairly so. Everyone is technically following their own objectives while the customer experience is going down the drain. For a Tier-0 consumer surface at Doordash, I was in charge of running a latency maintenance, improvement, and optimization plan for multiple years. This page was reliant on more than twenty backend microservices, each one being owned by a different team. Besides all the optimizations, a good SLO contract between our services and other services is what helped maintain the performance. What exactly is an SLO contract? You can think of an SLO contract as a formal written contract specifying the terms and conditions between a service that is a consumer and one of its dependent services. It may identify a single or several endpoints that have a latency budget at a given percentile, how that budget is measured, and the actions following a failure to meet the budget. It is usually agreed upon by the teams' engineering managers and is possibly stored in a Google Doc, GitHub, or any other durable location. Essentially, it transforms a vague expectation into a definite commitment between teams. The consumer team is given a reliable figure for them to allocate their own e2e latency budget, whereas the dependency team receives a clear understanding of what they need to defend and also the liberty to optimize everything else. An SLO is the minimum performance level that one team guarantees to another. Elements of a contract Make it brief. A good contract should be contained on a single page. Nobody reads lengthy contracts, and contracts that are not read do not lead to a change in behavior. The main components that need to be included in such a contract are (but not limited to): the endpoint or RPC method that is going to be measured, the latency target and percentile, the method used for measuring and the exact name of the metric, the traffic conditions under which the contract is applicable, the duration of the contract, the escalation path when the contract is violated. Here's an example: JSON JSON SLO Contract: Recommendations API -> Promotions Service Endpoint: POST /v1/promotions/lookup Target: p95 latency ≤ 800ms Measurement: server-side histogram, recorded at Promotions Service metric: promotions_lookup_duration_seconds, bucket 0.8 Conditions: traffic up to 12,000 RPS, payload size ≤ 4KB Effective: 2024-Q3 through 2025-Q2 (renewed quarterly) Escalation: If breached for two consecutive weeks at p95, Promotions on-call posts in #recs-promotions-slo with root cause within 5 business days. Sustained breach (4+ weeks) triggers a joint review with both EMs. Signed: [Recs EM] [Promotions EM] SLO Contract: Recommendations API -> Promotions Service Endpoint: POST /v1/promotions/lookup Target: p95 latency ≤ 800ms Measurement: server-side histogram, recorded at Promotions Service metric: promotions_lookup_duration_seconds, bucket 0.8 Conditions: traffic up to 12,000 RPS, payload size ≤ 4KB Effective: 2024-Q3 through 2025-Q2 (renewed quarterly) Escalation: If breached for two consecutive weeks at p95, Promotions on-call posts in #recs-promotions-slo with root cause within 5 business days. Sustained breach (4+ weeks) triggers a joint review with both EMs. Signed: [Recs EM] [Promotions EM] The hard part: How to negotiate a contract Creating the contract is really not a challenge; you can even reference past data and also future plans. But the difficult part is to get all the teams to agree with the numbers. I have discussed and agreed on these quite a bit, and I want to share some points that really worked. Begin with your e2e budget. If your web page has a 1.5-second p95 latency target, and the user request flow involves a call to four downstream services that run one after another, then those four services together can take no longer than about 1.2 seconds, allowing time for network, serialization, and your own processing. Do the math before the meeting. If you haven't figured out the breakdown of your own budget, then you're probably not ready to negotiate. Have your data ready and explain the effect to everyone in a comprehensible way. Get the actual current latency distribution for the dependency endpoint for the last 30 days. Display p50 p95 p99. Then explain what the actual penalty is. For instance, if a 300ms delay on your page results in a loss of $20M in company annualized revenue, present it and provide evidence. This is a figure that both engineering teams and management can agree upon. Understanding each other's position is very important. Operationalizing it Signing a contract and then forgetting about it is not a situation that you want to end up with. There are obviously certain things that you must do, like having a dashboard with the right metrics, relevant metrics, and the histogram bucket boundaries for these metrics. I would suggest reviewing compliance regularly in a monthly performance review meeting. The team that depends on you for its review of all the outbound contracts is its own ops review. You should consider treating any breach as a real signal. In fact, the escalation path should be triggered in case a contract is breached. A breach should be taken seriously, and a lightweight postmortem might be justified so that things could be done to root cause and fix what was broken. Actually, one of the things that most contracts have is a clause that states that they will expire. At renewal, the two teams will examine the compliance of the contract over the entire period, determine whether the assumptions that they made at the time of signing are still valid, and then decide if they want to renew the contract as it is, tighten the contract, loosen the contract, or terminate the contract. If the dependency is no longer the critical path that ultimately contributes to latency, then it might be okay to terminate the contract. The compounding effect Composing your initial contract is really the most challenging part. A brand new contract involves a lot of explaining, and the ground rules are set. But the second one is a piece of cake. At the platform where I was, the regimen was initiated with a lone contract between the homepage team and a single service downstream. Very quickly, this method proved to be the best fit for the entire organization whenever there was an inter-team latency dependency on a critical path. The contract-supported payload size tracking tool become a piece of infrastructure utilized by each and every team; and our team was not the only one using it. Even the contract system itself turned out to be the reference model for other departments of the company when they got to the point of formalizing their own performance accountability. When should you not use this I wouldn't recommend SLO contracts for every situation. They're overhead, and overhead has to be justified. Skip them for small organizations. If the consumer and dependency are owned by the same team or by two teams in the same group with the same manager, a contract adds process without changing incentives. A regular sync meeting is fine. Skip them for fast-changing dependencies. If the dependency service is in early development and its API surface is still in flux, wait until the dependency stabilizes. Skip them for low-criticality paths. The endpoint has to be on a critical path that matters to a real SLO.

By Ujjwal Gulecha

Performance Testing RAG Applications: Complete Engineering Guide

In this blog post, we will see how to perform a performance test on a retrieval-augmented generation (RAG) application properly, covering both speed and correctness, and how to wire both into a CI/CD pipeline so regressions get caught before they reach production. Performance testing a RAG application requires two separate testing gates: one for speed and one for answer quality. Traditional load testing tools measure response times but cannot detect hallucinations, where a model returns fast but factually incorrect answers grounded in fabricated context rather than retrieved documents. The guide demonstrates using k6 for load testing end-to-end latency and DeepEval for evaluating faithfulness and answer relevancy using an LLM-as-judge approach. Both gates are integrated into a GitHub Actions CI/CD pipeline so regressions in either performance or output quality are caught automatically on every pull request before reaching production. If you've come from a JMeter or k6 background as I have, your first instinct with a RAG endpoint is probably to point a load test at it and check response times. That gets you halfway there. A RAG app can return a fast, confident, completely wrong answer, and a plain load test will never tell you that. You need two testing surfaces, not one: performance and quality. This guide covers both, using a single running example throughout: a documentation assistant that answers "How do I run JMeter in non-GUI mode?" against a small knowledge base. Why RAG Breaks Traditional Load Testing Assumptions A conventional API returns a complete response, and you measure the round trip. A RAG endpoint does two expensive things before it answers: it retrieves context from a vector store or search index, then it streams a generated response token by token. That second part matters a lot. A single request can stream hundreds of tokens over several seconds, so "request duration" as a single number hides two very different problems: how long the model took to start answering, and how fast it generated once it started. A system with slow startup but fast generation feels broken to someone typing in a chat UI. A system with fast startup but slow generation is fine for a quick question but painful for a long document summary. Averaging those together tells you nothing useful. The Two Testing Surfaces: Performance and Quality I think of RAG testing as two separate gates that happen to run against the same endpoint. Performance answers: how fast is it, and does it hold up under load? This is k6's job, same as any other API load test, just with LLM-specific metrics layered on. Quality answers: is the answer actually grounded in what got retrieved, or did the model make something up? This is where DeepEval comes in, scoring faithfulness and relevancy on every response using an LLM as the judge. Neither gate alone tells the full story. A fast RAG app that hallucinates is worse than a slow one that's accurate, and a perfectly grounded app that takes eight seconds to respond will lose users regardless of correctness. Metrics That Actually Matter Performance Metrics MetricWhat it tells youTTFT (Time to First Token)How long a user stares at a blank screen before anything appearsITL (Inter-Token Latency)How smoothly tokens stream once generation startsTokens/secGeneration speed, matters most for long-form answersp95 / p99 latencyThe tail experience, not the average one TTFT is the most user-visible number in the whole system, and it's also the metric most classic load testing tools weren't built to isolate, since they were designed for atomic request/response cycles, not streams. Quality Metrics MetricWhat it tells youFaithfulnessIs the answer grounded in the retrieved context, or inventedAnswer relevancyDoes the answer address the actual question, or just sound plausibleContext precisionDid retrieval return the right chunks, ranked correctlyContext recallDid retrieval miss anything the answer needed These four metrics carry most of the diagnostic weight in RAG evaluation. Faithfulness and answer relevancy live on the generation side; context precision and recall live on the retrieval side. When faithfulness is low but context recall is high, the retriever did its job, and the model ignored it; that's a prompting problem, not a retrieval problem. Worth knowing the difference before you go tuning the wrong component. Hallucination Detection With DeepEval I'm using DeepEval here instead of RAGAS mainly because DeepEval treats evaluations as pytest test cases with pass/fail thresholds, which is exactly the shape you need for a CI/CD gate. It also accepts any LLM as the judge model, so it isn't locked to one vendor even though our example app happens to use Gemini. Here's what a test case looks like against our JMeter doc-assistant example: Python from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric from deepeval.test_case import LLMTestCase from deepeval.models import GeminiModel judge_model = GeminiModel( model="gemini-3.5-flash", api_key=os.getenv("GEMINI_API_KEY"), ) faithfulness_metric = FaithfulnessMetric(threshold=0.75, model=judge_model) answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.8, model=judge_model) def test_jmeter_non_gui_mode_answer(): question = "How do I run JMeter in non-GUI mode?" result = query_rag_app(question) test_case = LLMTestCase( input=question, actual_output=result["answer"], retrieval_context=result["retrieved_chunks"], ) for metric in [faithfulness_metric, answer_relevancy_metric]: metric.measure(test_case) status = "PASS" if metric.success else "FAIL" print(f"[{status}] {metric.__class__.__name__}: {metric.score:.3f}") failed = [m for m in [faithfulness_metric, answer_relevancy_metric] if not m.success] if failed: names = ", ".join(m.__class__.__name__ for m in failed) raise AssertionError(f"Metrics below threshold: {names}") Run this with pytest, and it either passes or fails like any other test. That's the whole point it turns a fuzzy "does the AI sound right" question into a binary CI/CD signal. The test suite includes retry logic to handle transient Gemini API 503 errors, automatically retrying up to 3 times with exponential backoff. DeepEval generates both JUnit XML and HTML reports, making it trivial to wire into any CI system that understands pytest output. Load Testing With k6 (and Why You Can't Measure TTFT Yet) Here's where things get frustrating if you came here looking for a clean TTFT measurement story: the k6 SSE extension (xk6-sse) is not compatible with k6 v2. It targets go.k6.io/k6 v1, and until it gets updated, you're stuck choosing between k6 v2's improved architecture or the ability to measure streaming metrics properly. So the companion repo does the pragmatic thing: it tests the /chat/complete endpoint instead of the /chat streaming endpoint, using k6's built-in http module. No custom binary, no extensions, just standard k6. The tradeoff is you lose true TTFT measurement, because /chat/complete waits for the full response before returning. What you get instead is end-to-end latency, which is still useful it tells you if the system is slow, just not why it's slow. Here's what the test looks like: JavaScript import http from 'k6/http'; import { Trend, Counter } from 'k6/metrics'; import { check } from 'k6'; const totalDuration = new Trend('total_duration_ms', true); const tokensPerSecond = new Trend('tokens_per_second'); const BASE_URL = __ENV.RAG_APP_URL || 'http://localhost:8080'; export const options = { scenarios: { rag_chat: { executor: 'ramping-vus', stages: [ { duration: '30s', target: 10 }, { duration: '1m', target: 10 }, { duration: '30s', target: 0 }, ], }, }, thresholds: { http_req_duration: ['p(95)<6000'], total_duration_ms: ['p(95)<6000'], }, }; export default function () { const startTime = Date.now(); const res = http.post( `${BASE_URL}/chat/complete`, JSON.stringify({ query: 'How do I run JMeter in non-GUI mode?' }), { headers: { 'Content-Type': 'application/json' }, timeout: '30s', }, ); const duration = Date.now() - startTime; check(res, { 'status 200': (r) => r.status === 200, 'has answer': (r) => JSON.parse(r.body).answer !== undefined, }); totalDuration.add(duration); // Rough tokens/sec estimate from word count const words = JSON.parse(res.body).answer.trim().split(/\s+/).length; tokensPerSecond.add((words / duration) * 1000); } The test ramps from 0 to 10 virtual users over 30 seconds, holds for a minute, then ramps back down. Thresholds are set at p95 < 6000ms for both http_req_duration and the custom total_duration_ms metric. When should you switch back to SSE? Watch the xk6-sse repo. Once it adds k6 v2 support, swap the endpoint from /chat/complete to /chat, add the SSE extension to your Dockerfile, and you'll get true TTFT measurement. Until then, this is the most pragmatic path forward: standard k6, no custom builds, just with the caveat that you're measuring end-to-end latency rather than streaming behavior. The companion repo includes both endpoints in the Express app so you can switch when you're ready: EndpointResponseStatusPOST /chatSSE streamReady for when xk6-sse supports k6 v2POST /chat/completeFull JSONUsed by k6 and DeepEval today Wiring Both Gates Into CI/CD Once both tests run locally, wiring them into GitHub Actions is mostly plumbing: start the app, wait for it to be healthy, run the k6 gate, run the DeepEval gate, both in parallel since they're independent. YAML name: RAG CI on: [pull_request] jobs: performance-gate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Write app env file run: | cat > app/.env << EOF GEMINI_API_KEY=${{ secrets.GEMINI_API_KEY } GEMINI_MODEL=gemini-3.5-flash FILE_SEARCH_STORE_NAME=${{ secrets.FILE_SEARCH_STORE_NAME } PORT=8080 EOF - name: Start RAG app run: docker compose up -d --build app - name: Wait for health run: | timeout 60 bash -c 'until curl -f http://localhost:8080/health; do sleep 2; done' - name: Run k6 load test run: docker compose --profile perf run --rm k6 quality-gate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Write app env file run: | cat > app/.env << EOF GEMINI_API_KEY=${{ secrets.GEMINI_API_KEY } GEMINI_MODEL=gemini-3.5-flash FILE_SEARCH_STORE_NAME=${{ secrets.FILE_SEARCH_STORE_NAME } PORT=8080 EOF - name: Start RAG app run: docker compose up -d --build app - name: Wait for health run: | timeout 60 bash -c 'until curl -f http://localhost:8080/health; do sleep 2; done' - name: Run DeepEval tests env: GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY } run: docker compose --profile quality run --rm deepeval Both jobs run on every pull request. A PR that slows down response time and a PR that quietly makes the model hallucinate get caught the same way, before either reaches a reviewer's eyeballs, let alone production. You'll need to add two secrets to your GitHub repo before the workflow will pass: SecretValueGEMINI_API_KEYYour Gemini API key from https://aistudio.google.com/apikeyFILE_SEARCH_STORE_NAMEThe store name from setup-store.js (format: fileSearchStores/your-store-id) Setting SLOs I'm deliberately not giving you one universal latency number to target. I've seen guidance ranging from sub-second targets for chat-style RAG apps to 3-5 second budgets for more complex document analysis, and the right number for you depends entirely on your retrieval backend, your model, and what your users are actually doing. Run the load test against your own baseline first, then set thresholds off that baseline, not off a number from a blog post (including this one). The example repo uses p95 < 6000ms as a starting point because that's what the test Gemini File Search RAG app achieves at 10 concurrent users with gemini-3.5-flash. Your mileage will vary dramatically based on: Model choice (flash vs pro, size of context window actually used) Retrieval backend (vector DB query time, number of chunks retrieved) Document size and complexity Network latency to your LLM provider What you should track regardless of the exact number: p95 and p99 latency, not just the median. The tail experience is what users complain about. Latency at your expected concurrency, not at 1 user. RAG apps often degrade non-linearly under load because of retrieval bottlenecks. Faithfulness and answer relevancy trending over time, not just pass/fail on one run. A metric that's consistently 0.90 dropping to 0.78 is a signal even if both pass the 0.75 threshold. Wrap-Up RAG performance testing is really two disciplines wearing one trench coat: classic load testing with LLM-aware metrics, and LLM-as-judge quality scoring that classic load testing tools were never built to do. Run them both, gate on both, and you'll catch the regressions that a speed-only test walks right past. The current state of tooling isn't perfect; you can't measure TTFT with k6 v2 without writing your own SSE client, and LLM-as-judge scoring has its own consistency quirks, but it's good enough to catch regressions before production, which is the whole point of a CI/CD gate. Head to the companion GitHub repo for the full working app, k6 script, DeepEval tests, Docker Compose setup, and GitHub Actions workflow you can clone and run locally in under five minutes. Happy testing! Have you run into hallucination regressions that a pure load test missed? I'd like to hear how you caught them; reply on X or open an issue on the companion repo.

By NaveenKumar Namachivayam

CORE

Service Industry Evolution: Beyond 99.9% Uptime With Evolving Technology

For years, service organizations measured operational efficiency through response time. A machine failed, a ticket dropped, a technician arrived on-site, and the diagnosis and repair resolved the issue. Industries dependent on physical assets accepted this framework because they believed that it was not possible to avoid downtime. The benchmark for operational excellence depended on how quickly teams reacted after disruption occurred. That definition of service reliability has changed dramatically. Across industries such as ATM infrastructure, elevator systems, industrial manufacturing, HVAC networks, utilities, and connected buildings, uptime has evolved from a technical KPI into a direct business expectation. A malfunctioning elevator inside a commercial tower immediately affects tenant experience. An unavailable ATM network during a transaction spike escalates into a customer-service issue within minutes. In sectors where Service Level Agreements (SLAs) define accountability, even short-lived disruption can simultaneously create financial penalties, reputational damage, and customer churn. This growing pressure explains why organizations are restructuring service operations around predictive intelligence, telemetry ecosystems, and AI-driven operational visibility. Businesses targeting 99.9% uptime, commonly referred to as “three nines” availability, now operate within extremely narrow tolerance margins. Operationally, that benchmark allows for less than nine hours of annual downtime across distributed infrastructure environments involving connected assets, IoT systems, APIs, cloud platforms, and field-service networks. Connected Assets Are Reshaping Service Delivery The most significant transformation inside the service industry is happening beyond customer-facing applications. Machines themselves are becoming active participants in operational decision-making. Modern industrial assets continuously transmit telemetry related to vibration intensity, thermal behavior, airflow fluctuations, voltage variation, load cycles, and component stress. Earlier maintenance environments depended heavily on scheduled inspections and manual servicing intervals. Predictive ecosystems now analyze live operational behavior continuously, allowing organizations to identify abnormal machine patterns before a visible breakdown occurs. Large elevator manufacturers increasingly rely on telemetry-driven systems that can identify brake-pressure instability and motor stress, even before shutdown occurs inside high-footfall commercial environments. Similarly, ATM infrastructure providers now use transaction telemetry and demand analytics to forecast cash replenishment cycles proactively during high-volume periods. According to McKinsey & Company, predictive maintenance typically reduces machine downtime by 30 to 50% and increases machine life by 20 to 40%. IBM has also estimated that such predictive maintenance frameworks can improve labor productivity while helping organizations reduce downtime and improve asset reliability. Why Predictive Maintenance Is Replacing Reactive Service Models Traditional field-service environments created inefficiencies that organizations quietly accepted for years. Once a machine failed, there was a simultaneous trigger effect on multiple disconnected workflows. Service teams logged tickets, identified technicians, diagnosed faults, verified spare-part availability, and scheduled follow-up visits. Very often, engineers reached the site without the required replacement component, forcing additional visits and extending downtime unnecessarily. Predictive service ecosystems reduce that operational friction. Modern AI-enabled maintenance systems increasingly integrate telemetry platforms directly with workforce management tools, inventory systems, and service histories. Instead of merely identifying faults, these environments support operational decision-making before engineers physically engage with the asset. operational eventconventional workflowpredictive ai-led workflow ATM cash depletion Shortage identified after customer disruption AI forecasts replenishment needs proactively Elevator motor instability Technician dispatched after operational failure Telemetry predicts degradation before shutdown HVAC compressor fluctuation Complaint-driven escalation Continuous monitoring detects abnormal pressure patterns Industrial equipment fault Manual diagnosis during site visit AI identifies component failure in advance Modern industrial-service providers use AI-led technician orchestration systems that evaluate technician expertise, asset familiarity, certification levels, and spare-part availability before dispatch approval occurs. The objective is not faster repair cycles anymore. Organizations are now trying to prevent customer-facing disruption before it begins. Observability Is Replacing Conventional Monitoring Earlier, the designs of monitoring systems ensured they could primarily identify if the infrastructure was functioning properly. Modern service ecosystems require deeper operational visibility because enterprises no longer operate in isolated environments. Most organizations now manage interconnected systems spanning IoT networks, enterprise applications, APIs, operational technology environments, cloud platforms, and legacy infrastructure. In such environments, isolated alerts provide limited value because operational disruption often emerges from cascading dependencies rather than a single infrastructure failure. Observability platforms address this challenge by correlating telemetry, metrics, traces, logs, and behavioral anomalies into unified operational intelligence layers. Instead of simply reporting that a service has failed, these systems analyze why the disruption occurred, which systems contributed to it, and how the issue may spread across dependent environments. Platforms such as Datadog, New Relic, and Dynatrace have become central to enterprises attempting to maintain high-availability infrastructure environments. Agentic Observability Is Introducing Autonomous Operations The latest evolution in observability is moving beyond monitoring toward autonomous operational investigation. Dynatrace’s Davis AI engine, for example, maps infrastructure dependencies continuously across cloud and on-premises ecosystems. Instead of overwhelming operations teams with fragmented alerts, the platform isolates probable root causes and predicts which infrastructure layers may destabilize next. Several enterprises are now moving toward what technology leaders describe as “agentic observability,” where AI systems autonomously investigate operational anomalies, correlate dependencies, recommend corrective action, and reduce the likelihood of SLA breaches before customers experience visible disruption. External observability platforms such as Site24x7 and UptimeRobot further strengthen operational assurance by validating customer-facing service availability across regions continuously. According to Gartner, as predictive root-cause analysis becomes more mature across enterprise infrastructure ecosystems, enterprises adopting AI-led operational intelligence frameworks help to reduce incident-resolution timelines. Why Incident Response Speed Has Become a Competitive Differentiator Even the most advanced predictive ecosystems cannot eliminate every operational incident. What increasingly separates high-performing service organizations from reactive operators is the speed and coordination of their response environments once disruption begins. Modern incident-management platforms are now heavily automated. Enterprises increasingly use AI-enabled response systems that identify affected services, create incident channels automatically, notify relevant engineers, and coordinate escalation processes in real time. Several operational capabilities now determine how effectively organizations respond to high-severity incidents in modern uptime environments. These include: Faster escalation reduces Mean Time to Resolution (MTTR) and minimizes SLA impact.Automated response coordination that prevents communication delays during outagesIntelligent alert routing to ensure that the right teams engage immediately.Slack-native response environments to improve collaboration across distributed teams.AI-driven incident workflows that reduce operational confusion during high-severity failures. Platforms such as PagerDuty, Rootly, FireHydrant, and incident.io are helping enterprises streamline incident coordination significantly across distributed operational environments. Uptime Architecture Is Becoming a Strategic Business Decision Many enterprises still approach disaster recovery as a secondary IT function rather than a central business-continuity strategy. That approach is becoming increasingly risky in sectors where even brief disruption can affect customer trust and SLA commitments. Modern uptime environments now depend heavily on resilience architecture designed to absorb disruption without affecting customer operations. Enterprises are therefore investing aggressively in multi-region infrastructure, failover environments, and redundancy frameworks intended to eliminate single points of failure. Several financial services firms and industrial infrastructure providers now operate active-active environments where workloads distribute simultaneously across multiple operational regions. If one region experiences instability, remaining infrastructure absorbs traffic automatically with minimal disruption. Recovery-as-Code Is Changing Disaster Recovery Planning Other organizations rely on active-passive models where secondary standby environments activate rapidly during outages. Large enterprises have also started adopting hybrid multi-cloud strategies involving combinations of AWS, Azure, and Google Cloud to reduce dependency on a single provider. Disaster recovery itself has evolved significantly over the last few years. Earlier recovery frameworks depended heavily on manual restoration processes, isolated backups, and infrastructure rebuilding exercises that often stretched across several hours. Modern recovery environments increasingly rely on software-driven replication and automated restoration systems. Infrastructure-as-Code frameworks such as Terraform and Pulumi now allow enterprises to recreate infrastructure environments programmatically. Platforms such as AWS Elastic Disaster Recovery and ControlMonkey are helping organizations replicate workloads, restore cloud configurations, and improve recovery consistency during failover scenarios. Enterprises increasingly design systems capable of functioning effectively even while failure conditions occur. Why Data Availability Has Become as Critical as Infrastructure Availability As service ecosystems become more dependent on real-time operational intelligence, enterprises are also discovering that uptime extends far beyond infrastructure resilience alone. Data availability now plays a key role in maintaining service continuity. In asset-intensive industries, operational environments depend heavily on uninterrupted access to telemetry streams, maintenance histories, customer records, compliance data, and software supply chains. A ransomware incident or corrupted recovery environment can affect service operations as severely as infrastructure failure itself. This explains why organizations are investing heavily in platforms such as Cohesity and Rubrik, which focus on rapid recovery, immutable backup environments, and zero-trust data resilience strategies. Similarly, JFrog has increasingly positioned software supply-chain availability as a critical reliability layer for enterprises managing continuous deployment environments. Chaos Engineering Is Moving into the Mainstream For years, organizations assumed failover systems would function correctly during outages simply because backup infrastructure existed architecturally. Recovery environments often failed under real-world pressure because teams had never tested them comprehensively. Chaos engineering emerged as a direct response to that gap. Platforms such as Gremlin and LitmusChaos deliberately simulate disruption scenarios inside controlled environments. Teams intentionally interrupt APIs, overload infrastructure layers, disable databases, and simulate cloud-region failures to evaluate whether resilience mechanisms function correctly under operational stress. Organizations operating large-scale digital infrastructure increasingly use controlled-failure testing to understand how systems behave during real outages rather than relying solely on theoretical resilience assumptions. The Operational Disciplines Separating Mature Reliability Teams from Reactive Service Organizations Organizations that consistently maintain high uptime rarely depend on infrastructure investment alone. Most high-performing service environments combine technology modernization with disciplined operational governance frameworks designed to reduce preventable disruption. Error Budgets Are Forcing Teams to Balance Innovation with Stability Modern Site Reliability Engineering (SRE) environments no longer chase unrealistic zero-downtime goals. Organizations define acceptable downtime thresholds and pause feature deployment if operational instability crosses predefined limits. Progressive Deployment Models Are Reducing Large-Scale Service Failures Many enterprises now use canary deployment strategies that release updates gradually across smaller user environments before full-scale deployment occurs. This allows organizations to isolate instability before broader infrastructure disruption affects customers. Blameless Post-Mortems Are Improving Long-Term Operational Maturity Several organizations have shifted away from punitive outage-review cultures because delayed escalation often worsens downtime impact. Blameless review frameworks encourage teams to identify missing safeguards and process weaknesses more transparently. Change-Freeze Windows Are Becoming Standard Across High-Risk Operations Industries operating under strict SLA commitments increasingly enforce no-change windows during high-volume transaction periods, financial closings, infrastructure migrations, or critical production cycles. Incident Command Structures Are Accelerating Crisis Coordination High-availability environments increasingly rely on predefined incident-response hierarchies involving technical leads, communication owners, escalation managers, and operational coordinators. Enterprises that consistently maintain high uptime typically treat governance maturity as seriously as infrastructure resilience. Operational discipline often determines whether advanced technology investments really deliver measurable reliability outcomes. Technologies Driving Predictive SLA Management The service industry is moving steadily toward operational environments where organizations can forecast SLA risk before customer disruption occurs. This transition is accelerating because enterprises now recognize that service continuity directly influences revenue stability, retention, and operational trust. Telemetry Analytics Is Helping Enterprises Detect Early-Stage Operational Instability Connected infrastructure environments continuously generate operational intelligence related to machine performance, infrastructure stress, transaction behavior, and service degradation patterns. AI-Led Anomaly Detection Is Improving Failure Prediction Accuracy Platforms such as Dynatrace, IBM Maximo Application Suite, and C3 AI now combine anomaly detection with machine-learning models capable of forecasting operational degradation across industrial systems. SLA Risk Scoring Models Are Changing Operational Decision-Making Solutions such as Sirion and Nobl9 increasingly combine telemetry analytics, infrastructure dependencies, incident history, and contractual thresholds to generate SLA breach probability scores. Predictive environments can now identify rising compliance risks a week to two before a potential SLA breach occurs. Workforce Orchestration Systems Are Improving First-Time Resolution Rates Modern field-service environments increasingly integrate AI-led dispatch intelligence with technician certification data, inventory systems, and asset history. This allows organizations to assign the most suitable technician with the right replacement components before service disruption expands further. The broader transition toward predictive SLA intelligence reflects a larger shift across the service industry. Organizations are gradually moving away from response-driven operations toward environments capable of identifying operational instability before customers experience visible disruption. The Future of Service Operations Will Depend on Prevention The digital transformation of the service industry extends far beyond automation or cloud migration. Organizations leading this transition increasingly combine connected telemetry ecosystems, AI-driven observability, predictive asset intelligence, resilient infrastructure architecture, workforce orchestration platforms, and operational governance frameworks into unified service environments designed around prevention rather than response. Historically, service organizations optimized for repair efficiency. The next generation of operational leaders is optimizing for disruption avoidance. Predictive intelligence, connected telemetry, and AI-led service orchestration are steadily becoming foundational requirements for enterprises operating large-scale asset-driven service ecosystems. Over the next few years, the competitive gap between service organizations will no longer depend solely on who resolves incidents faster. It will depend on which enterprises can predict operational instability earlier, coordinate response systems more intelligently, and prevent disruption before customers experience its impact. In industries where uptime increasingly shapes customer trust, contractual performance, and operational continuity simultaneously, prevention is steadily becoming the new benchmark for service excellence.

By Abhishek Sharma

AI Won't Keep You from Hitting the Scalability Wall

Using AI to build integrations? You might just be hitting the scalability wall faster. Discover why faster builds don't solve the long-term cost of ownership. There's an idea making the rounds in B2B SaaS product and engineering meetings right now. It sounds reasonable. It feels optimistic. And it's leading companies straight into the same trap they've always fallen into, just at an accelerated rate. The idea is that "We can use AI to build our integrations." Two years ago, adding in-house dev for an ERP integration to the roadmap meant a three-month research-and-dev cycle. Today, the sentiment is often: "We can knock that out in a weekend." And in the early stages, it's often correct. Modern AI coding agents are remarkably good at generating boilerplate code, interpreting API documentation, and suggesting data mapping logic. AI can help you go from zero to integrated faster than ever before. That's completely true. But a focus on speed-to-build hides a deeper issue. Every integration you ship is an asset, but it comes with a long-term maintenance commitment. That's right. AI-assisted custom integration builds still hit the same scalability wall that has frustrated B2B SaaS engineering teams for years. In many cases, these builds defer the pain. But since AI can encourage teams to say "yes" to more integration requests, they sometimes amplify it. As a result, your team may hit the wall sooner and harder. AI, done right, builds integrations faster. It doesn't handle everything else that makes the integration run reliably at scale. Everything's Great at the Beginning When you use AI to build a custom integration, you're generally optimizing for the near term. You're cutting down the time it takes to write the initial code, map the first few fields, and get something working. Everything moves quickly, and it feels like a massive win. But the scalability wall doesn't show up now; it comes later and is composed of blocks that AI doesn't touch. API tracking – Third-party APIs are part of living systems. Their vendors deprecate endpoints, change rate limits, update authentication requirements, and release breaking changes with varying degrees of notice. Your AI coding agent helped you ship the integration. But it won't proactively monitor the Salesforce or NetSuite changelog for you, and it won't be on call when a "backward-compatible" update breaks something.The ownership gap – If an AI-assisted build handles 90% of the logic but hallucinates an edge case in a retry loop, your senior devs are the ones debugging when it fails in production (not the AI). AI accelerates the build, but the accountability remains with the team.Infrastructure overhead – Code is only one part of an integration. You still need to build and maintain everything around it: auth, logging, alerting, SOC 2-compliant data handling, and customer-facing configuration UIs. AI doesn't generate that operational layer.Customer requirements multiply – The second customer who wants your Salesforce integration doesn't want exactly what the first customer wanted. So you modify it. Then a third customer needs the original version, but with different field mappings. Now you have three versions of the same integration – each slightly different, each with its own maintenance obligation, none of them happy with anything less than individual attention. Multiply that pattern across your catalog, and you understand how teams can end up maintaining twenty-five versions of a single integration, with all the pain that entails. None of these are new problems. They've existed as long as B2B SaaS teams have been building integrations in-house. AI is simply making it easier to get to these problems faster. Why AI Feels Like the Answer When AI coding tools emerged as serious productivity multipliers, it was natural to look at the integration backlog problem and see a solution. If the bottleneck is build speed, and AI makes you faster, the math seems straightforward. But the bottleneck was never the build speed. The bottleneck was (and is) the ongoing cost of ownership. It's the time your devs lose every time a third-party API changes. It's the afternoon that disappears when an integration fails, and your customers know before you do. It's the engineering lead explaining, again, that the roadmap has slipped because, well…integrations. It's the growing stack of tech debt that you keep working around. AI lowers the barrier to entry. True. But it also lowers the barrier to overcommitment. When it becomes faster and easier to build integrations, teams build more of them. What starts as a productivity boost turns against you. Instead of five integrations, you build fifteen. Instead of "not yet," you say "we can probably do that quickly." Before you know it, you've accepted the maintenance commitment of all those integrations. The scalability wall exists because the relationship between the number of custom integrations and the dev resources required to maintain them is essentially linear. If it takes one engineer to maintain five integrations, you eventually reach a point where your team is no longer building your core product. Instead, it has morphed into the integration maintenance department. AI-assisted development shifts the starting line, but it doesn't change the slope. By building faster without a stable, managed foundation, you are simply accelerating your arrival at the point where maintenance debt overwhelms new feature development. What happens as you attempt to scale The scalability wall doesn't arrive with a big announcement. In most cases, it grows over time as the following occurs: Technical debt accrues under deadline pressure – When teams race to ship, they take shortcuts. Values are hardcoded that should be configurable. Error handling is skipped. Testing is abbreviated. Code gets tightly coupled in ways that make future changes expensive. The worst part is that this debt doesn't disappear when an integration deploys. Instead, every dev who touches that code later inherits the results of less-than-optimal decisions made in the moment.Roadmaps are held hostage – Then you try to maintain a custom integration catalog while building something new. Teams that were supposed to be shipping product features find themselves in firefighting mode, chasing failures, handling escalations, and applying patches to code that was never meant to live this long. Some teams report spending 70% or more of their integration-related engineering time on maintenance, monitoring, and debugging rather than building new value. One organization found half of its R&D team dedicated to maintaining hundreds of integrations. That's not an integration strategy. That's an integration crisis in the offing.Zombie integrations proliferate – Some integration requests are legitimate but point to integrations that shouldn't be built and maintained without a deliberate strategy. When AI makes it easy to say yes, teams say yes, and ship integrations that live in production forever, draining engineering bandwidth. The customer who requested it may have churned. The use case may have changed. But the zombie is still there, still running, still using resources. The Right Question to Ask Before You Build The question isn't "Can AI help us build this faster?" The question is: "Should we own the infrastructure required to keep this alive for the next five years?" Because that's what you're agreeing to. Not just the initial capital investment, but the day-to-day maintenance, dealing with customer edge cases, implementing security patches, monitoring everything, and handling all the tickets. Every custom in-house integration your team ships is a product you now own – with all the ongoing obligations that come with it. If the answer to the second question above is, "We're not sure we can sustain this at scale," then the solution requires a different architecture approach. Generating Deterministic Results To understand what a different architecture looks like, let's drill into what AI is – and what an integration platform is. AI is generative. It creates a solution/provides an answer at a specific moment, shaped by the context it's given. That makes it powerful for accelerating builds. It also makes it inherently variable. And non-deterministic outputs in business-critical data flows often introduce risks that careful, platform-tested infrastructure doesn't have. An embedded iPaaS is deterministic. It provides a standardized infrastructure designed to handle the full lifecycle of an integration – not just the initial build, but auth rotation, retry logic, customer-specific configs, monitoring, logging, and deployment at scale. The most successful teams use both. AI for efficiency: writing custom logic, generating complex data mappings, and accelerating new builds. And the integration platform for scalability: handling the operational layer that the AI should not reinvent every time. By wrapping integration logic in an integration platform, you decouple the build from the maintenance. When a third-party API changes, you don't have to hunt through dozens of AI-generated scripts – you update the relevant component, and it propagates across your ecosystem. What It Looks Like to Escape the Wall The teams that get over (or around) the scalability wall don't do it by building faster. They do it by changing what they're building on. An embedded iPaaS handles the infrastructure that breaks at scale, so your team can focus on integration logic rather than plumbing. The platform does the heavy lifting – Auth flows, retry logic, webhooks, auto-scaling compute, logging, config wizards, SOC 2-compliant data handling – all of it is provided and maintained by the platform. Your devs don't build it. They don't maintain it. That alone can reduce the code your team writes by 80% or more compared to in-house builds. What remains is the business logic – the part that actually delivers value to customers.Build once, deploy to many – When you productize integrations on a standard platform, you're not building a new integration for every customer. You're deploying a configurable integration that adapts to each customer's credentials, endpoints, and data mapping. One integration serves dozens (or hundreds) of customers. Updates apply across the board. That's a fundamentally different approach than maintaining twenty-five customer-specific variations in parallel.Non-engineers can own more of the lifecycle – Deployment, configuration, and first-level support don't need to involve engineering when customers and customer-facing teams have the right tools. Support staff can investigate issues without pulling an engineer from the roadmap. Customers can activate and configure integrations themselves from an embedded marketplace. Engineering stays focused on building, not on the operational overhead.Gain visibility across your entire catalog – When all integrations run on a single platform, monitoring and alerting work across all of them at once. You identify issues before customers do. You troubleshoot with full log access. You have a level of visibility into what is happening that's uncommon with custom in-house development. AI Still Has a Role, But It's Bounded None of this is an argument against AI. It's an argument for using AI where it's genuinely useful (and an argument against using it to solve problems it wasn't designed to solve). Used well, the AI can accelerate the creation of code running on a scalable foundation, rather than accelerating the accrual of tech debt. When you use AI inside a platform designed for scale, it works in your favor. When you use AI to build faster on a custom architecture that doesn't scale, you hit the wall sooner, but with more integrations already built. Build a Sustainable Integration Approach If you're evaluating how AI fits into your integration approach, here are the essentials: Adopt a tiered model – Not every integration deserves the same treatment. Productize high-volume integrations on the platform to make them reusable and maintainable. Build to bespoke requirements where the contract value justifies the build. Empower customers to create their own workflows for the long tail of idiosyncratic requests that no integration catalog can anticipate. And employ in-app agentic functionality as needed to make your workflows the structured, deterministic tools that AI agents can discover and invoke.Use AI for development velocity – AI excels at accelerating new builds and helping developers handle complex logic. Let the platform own everything else (auth, retries, logging, alerting, deployment, and the customer configuration experience). Don't ask AI to recreate that operational layer for every new integration.Track the metrics that matter after Day 1 – Build speed matters. But so do maintenance hours, average activation time, support ticket volume, and the amount of engineering time freed up for core product work. Those last two numbers are where a sustainable integration strategy shows up in the data.Audit regularly – As your catalog grows, so does the population of integrations that may no longer justify their maintenance burden. Retire integrations before they become a drain on the team. Faster Doesn't Equate to Scalable The "we can just use AI to build this faster" idea comes from a real place. Integration backlogs, customer pressure, and competitive urgency are all part of it. And AI absolutely speeds up individual builds. In the short term, that matters. But velocity without a stable foundation means you hit the scalability wall faster. Velocity doesn't make third-party APIs more stable. It doesn't reduce the maintenance burden as your catalog grows. And it doesn't change the question that every integration request brings your team: "Are we prepared to own this forever?"

By Bru Woodring

From Bash Script to Operational Triage: What Eight Months of Kubernetes Debugging Taught Me

In November 2025, I published a Bash script that analyzed Kubernetes clusters in about 60 seconds. It generated HTML reports, surfaced crash loops, orphaned resources, and other operational issues that were easy to overlook. The most interesting part wasn't the script — it was what happened after people started running it. Many told me they found problems they hadn't known existed. Looking back, the bash script wasn't really solving debugging. It was solving prioritization. I just didn't have the vocabulary for it yet. That script eventually became four different experiments, then a collection of small scanners, and eventually the dashboard shown in this article. Over the next eight months, that script evolved into OpsCart Watcher — an open-source operational triage dashboard for Kubernetes. This article is about what the journey taught me, and what I think is still missing from most Kubernetes environments. OpsCart Watcher — operational triage for Kubernetes (6 minutes) The Problem the Script Revealed The script did one thing well: it looked at an entire cluster and listed what was broken. Engineers who ran it kept telling me the same thing — "I had no idea this was there." That response was the important signal. These engineers had Grafana, Prometheus, and kubectl. Visibility was not their problem. The problem was that nothing told them to look at this specific namespace, this specific pod, this specific storage volume — before it became an incident. Consider a pod in CrashLoopBackOff for 19 days with 5,000+ restarts. To a metrics dashboard, that deployment looks healthy: replica count satisfied, a pod exists in Running state between crashes, CPU and memory flat because the container barely lives long enough to consume anything. The dashboard is answering the question it was built to answer — is the cluster meeting its SLOs? — and the answer is yes. The question nobody built tooling for: what deserves attention right now? LayerWhat It AnswersToolsMetricsIs the cluster meeting its SLOs?Prometheus, Grafana, DatadogPer-resource stateWhat is this specific pod doing?kubectl, k9s, LensOperational triageWhat deserves attention right now?Prioritizing operational work across cluster state What Triage Looks Like in Practice Overview page — Incident Score 41/100, KPI bar, Top 5, War Room panel The first time I ran the rebuilt dashboard against a cluster with real failures, the top of the screen didn't show me a CrashLoopBackOff pod. It showed me four CrashLoopBackOff pods spread across three namespaces, collapsed into a single operational problem: Plain Text 1. 4 pods crash-looping CRITICAL payments/fraud-detection (1810 restarts) → kubectl logs fraud-detection-... -n payments --previous That collapsing is the entire idea. Instead of inspecting every deployment individually, I was looking at a ranked list of operational problems — each with a severity, a location, and the exact kubectl command to start investigating. The full output for this environment: Plain Text Incident Score: 41/100 (Degraded) Top 5 Things to Fix: 1. 4 pods crash-looping CRITICAL 4 pods 2. 3 image_pull_backoff issues CRITICAL 3 items 3. 1 privileged_container issue CRITICAL 1 item 4. 1 namespace missing NetworkPolicy HIGH 1 ns 5. 3 orphaned PVCs wasting money MEDIUM 80 GB None of these had triggered an alert. All were present and accumulating before the scan. The Incident Score — a composite 0–100 across reliability, security, and waste — exists for one reason. Engineers fix incidents. Managers remember numbers. "We moved the Incident Score from 41 to 67" is a sentence that sticks. The crash loops and NetworkPolicies are the work behind it. The Step After Detection Finding problems was never the hard part. Knowing where to begin was. The most common feedback on the original bash script was some version of: "I found the problem, but I still didn't know what to do next." In March, I wrote about finding a container with 24,069 restarts that had been accumulating undetected. Finding it took sixty seconds. The next hour was the actual work: what do I run first? Is this configuration or code? Is it customer-facing? The investigation page is my answer to that hour. Investigation page — OpsCart Assessment, Evidence, Recommended Investigation One click from any triage finding opens a dedicated investigation view: Plain Text OpsCart Assessment This workload has restarted 1810 times over 6 days. The restart rate appears stable, suggesting a deterministic configuration or application failure rather than an intermittent infrastructure issue. No referenced ConfigMaps or Secrets were detected in the pod spec — missing configuration is unlikely to be the root cause. Investigation should begin with previous container logs. Estimated time: 5–10 minutes. Evidence [1810 Restarts] [CrashLoopBackOff] [6d] [Deployment/fraud-detection] Recommended Investigation HIGH CONFIDENCE Check previous container logs MEDIUM Verify ConfigMaps and Secrets exist LOW Check for OOMKill in events The assessment is rules-based — no AI. It reads restart count, failure pattern (stable vs accelerating), and referenced configuration objects, then produces a deterministic, auditable summary. The confidence levels reflect how a senior engineer actually reasons: previous logs are almost always the right first move for a crash loop; OOMKill is worth checking but less likely. This is the part kubectl doesn't give you. Neither does Lens, k9s, or Headlamp. From "What Is Broken?" to "What Changed?" The biggest architectural change came when the dashboard gained memory. The first version of the tool answered: "what is broken?" The current version — backed by a small embedded database recording every scan — answers "what changed?" That sounds like a minor distinction. Operationally, it changes everything. An incident that has existed for three days deserves different attention than one that appeared five minutes ago. A cluster whose Incident Score dropped eight points overnight is telling you something that no single scan can. War Room — critical issues with visual differentiation per type Every KPI now carries a trend arrow — critical issues up three since the last scan, waste down one — and the Incident Score shows a seven-point sparkline. Each incident is tracked with first-seen and last-seen timestamps and an active/resolved status, so "CrashLoopBackOff — first detected 6 days ago, still active" replaces "CrashLoopBackOff." Operational memory changed the tool from a scanner into something that remembers the history of a cluster. What This Is Not The triage pattern does not answer when an issue started at the metrics level, why an application is slow, or whether last Tuesday's deployment caused a regression. Prometheus, APM tooling, and deployment audit logs remain the right tools for those questions. The triage layer is not a replacement for observability. It is the layer that tells you which questions to ask of your observability stack. The Biggest Lesson When I started, I thought Kubernetes debugging was about collecting more information. It wasn't. Kubernetes already exposes almost everything an operator needs through its API. The difficult part is deciding what deserves attention first. Over eight months, I found myself spending less time searching for failures and more time ranking them. That is ultimately what OpsCart became — not another dashboard, but a prioritization engine for cluster operations. Why Open Source I considered keeping the dashboard private. Instead, I open-sourced it because operational patterns only become useful when they're tested across different clusters. Every environment fails differently, and I wanted the prioritization model to evolve from real-world feedback rather than a single infrastructure. The Remaining Gap The conclusion from my March article is still true: the question worth asking of your environment is not whether these conditions exist — they almost certainly do — but whether your current observability layer would surface them before they become incident preconditions. Eight months of building has only made that conclusion more specific. The gap is not data. The gap is attention: knowing which five things, out of hundreds of resources, deserve a human's time right now. Eight months ago I thought I was building a better debugging script. I wasn't. I was building something that helps operators decide where to spend the next ten minutes. About the environment: The scenarios shown in this article — CrashLoopBackOff pods, orphaned PVCs, missing NetworkPolicies, privileged containers — are representative of what OpsCart finds on real production clusters. The environment shown is a dedicated demonstration cluster configured with realistic failure scenarios. No production data was used. About the tool: OpsCart Watcher is open-source at github.com/opscart/opscart-k8s-watcher. It deploys as a single read-only container: Shell kubectl apply -f https://raw.githubusercontent.com/opscart/opscart-k8s-watcher/main/deploy/dashboard.yaml kubectl port-forward -n opscart-system svc/opscart-watcher 8080:80

By Shamsher Khan

CORE

Add Observability to Your React Native Application in 5 Minutes

In modern application development, feature flags are the guardrails that keep experiments controlled and rollbacks safe when conditions shift. If feature flags act as the guardrails, observability provides the visibility: the headlights (traces), mirrors (logs), and dashboard instruments (metrics) that reveal what’s happening in the environment and how well a feature is performing. Together, feature flags and observability unlock powerful insights by correlating code changes with real-time system behavior. This combination reduces time-to-diagnosis and builds greater confidence when rolling out new features. In this post, we’ll walk through just how to add observability to a React Native application using LaunchDarkly’s observability SDK. To demonstrate the process, we’ll build on the PlusOne app, a simple counter app that includes increment (+1), reset, and error-triggering buttons. This lightweight demo provides a clean foundation to showcase how logs, traces, and errors can seamlessly flow into LaunchDarkly for monitoring and debugging. Prerequisites LaunchDarkly account. Sign up for a free one here.Visual Studio or another code editor of choice. All code from this tutorial can be found on GitHub. Setting Up Your Environment Before running a React Native app, make sure your development environment is set up correctly. You can find the full setup instructions for both Android and iOS here. In this tutorial, we'll be running iOS, but keep in mind Expo Orbit, the platform we'll be using to run our iOS simulator, requires both Xcode and Android Studio to be installed. After going through the instructions, you should have the following installed: Node JS (preferably via nvm)Watchman for file monitoringJDK via zulu package managerAndroid Studio. Don’t forget to set your Android_Home environment variablesXcode for the iOS simulatorCocoapods for iOS dependency managementExpo Orbit for running Expo apps on Android or iOS If you're using Android, don't forget to add your environment variables to bash or zsh profile. JavaScript export ANDROID_HOME=$HOME/Library/Android/sdk export PATH=$PATH:$ANDROID_HOME/emulator export PATH=$PATH:$ANDROID_HOME/platform-tools Starting Up the PlusOne App To get started, let’s clone the repo for the PlusOne app and run npm install to ensure the proper dependencies are present in our node_modules file. Clone the repo. JavaScript git clone https://github.com/arober39/PlusOne Install dependencies using npm. JavaScript cd PlusOne npm install We’ll also need to run both the prebuild command to generate the iOS file and the expo run command to run the iOS simulator. Prebuild for iOS. JavaScript npx expo prebuild Run expo app. JavaScript npm expo run:ios Now we can view the iOS app in the iPhone simulator using npm. JavaScript # iOS npm run ios # Android npm run android The app should look something like this: Feel free to interact with the app to ensure all is working as expected. As you can see in the code, we have three buttons: one that adds one to the displayed number, one to bring the count back to zero, and an intentional Error button to test error monitoring within the LaunchDarkly UI. JavaScript // app/index.tsx import { useState } from "react"; import { StyleSheet, Text, TouchableOpacity, View } from "react-native"; export default function Index() { const [count, setCount] = useState(0); const handleReset = () => setCount(0); const handleIncrement = () => setCount((prev) => prev + 1); const triggerRecordedError = () => { try { throw new Error("Simulated controlled error from Plus One app") } catch (e) { alert("You intentionally threw an error") } }; return ( <View style={styles.container}> <View style={styles.header}> <Text style={styles.headerText}>Plus One</Text> </View> <View style={styles.counterWrapper}> <Text style={styles.counterText}>{count}</Text> </View> <View style={styles.actionsRow}> <ButtonBox label="Reset" onPress={handleReset} /> <ButtonBox label="+1" onPress={handleIncrement} /> <ButtonBox label="Error" onPress={triggerRecordedError} /> </View> </View> ); } type ButtonBoxProps = { label: string; onPress: () => void; }; function ButtonBox({ label, onPress }: ButtonBoxProps) { return ( <TouchableOpacity onPress={onPress} style={styles.button} activeOpacity={0.8}> <Text style={styles.buttonText}>{label}</Text> </TouchableOpacity> ); } /* The rest of the application code */ Now that we have verified a working app, we can add observability support by downloading the observability React Native SDK. Install LaunchDarkly SDK dependencies. JavaScript npm install @launchdarkly/react-native-client-sdk npm install @launchdarkly/observability-react-native Next, you’ll need to initialize the React Native LD client in the app/_layout file. Replace the in the layout file by pasting the following code. JavaScript // app/_layout.tsx import { Observability } from '@launchdarkly/observability-react-native'; import { AutoEnvAttributes, LDOptions, LDProvider, ReactNativeLDClient } from '@launchdarkly/react-native-client-sdk'; import { Stack } from 'expo-router'; import { useEffect, useState } from 'react'; const options: LDOptions = { applicationInfo: { id: 'Plus-One', name: 'Sample Application', version: '1.0.0', versionName: 'v1', }, debug: true, plugins: [ new Observability({ serviceName: 'my-react-native-app', serviceVersion: '1.0.0', }) ], }; const userContext = { kind: 'user', key: 'test-hello' }; export default function RootLayout() { const [client, setClient] = useState<ReactNativeLDClient | null>(null); useEffect(() => { // Initialize client const featureClient = new ReactNativeLDClient( 'mob-abc123', AutoEnvAttributes.Enabled, options, ); featureClient.identify(userContext).catch((e: any) => console.log(e)); setClient(featureClient); // Cleanup function that runs when component unmounts return () => { featureClient.close(); }; }, []); if (!client) { return null; } return ( <LDProvider client={client}> <Stack /> </LDProvider> ); } First, we’re importing the Observability SDK as well as a few LD libraries to add options and attributes to the LD client. Initialized the SDK and plugin options.Defined the user context.Lastly, you initialized the client. Now that you have defined your LD React Native client, you can implement different observability methods within your application logic. We can do this by importing the LDObserve library in the app/_layout.tsx file. JavaScript import { LDObserve } from '@launchdarkly/observability-react-native'; Then, add the recordError() method within the triggerRecordedError function inside the app/_layout.tsx file. This will allow for error messages to be sent back to the LD UI. JavaScript const triggerRecordedError = () => { try { throw new Error("Simulated controlled error from Plus One app") } catch (e) { LDObserve.recordError(e as Error, {feature: "test-button"}) alert("You intentionally threw an error") } }; Before being able to receive data in the LD UI, you’ll need to add your mobile key to the React Native LD client, which can be found by logging in to the LD UI. Once logged in, tap the settings button at the bottom left. Navigate to the Projects page and click Create to create a new project. Define the new Project and click Create Project. Then, define the environment where you would like your data to be sent. Now, grab the mobile key by pressing the three dots for the environment and selecting the mobile key, which will copy the key to your keyboard. Then, add it to the app/_layout file. JavaScript const featureClient = new ReactNativeLDClient( ‘mob-abc123’, AutoEnvAttributes.Enabled, options, ); Finally, you can generate data by interacting with your app in the iOS app simulator. Feel free to restart the app to ensure data is displaying in real time. JavaScript npm expo run:ios Once you navigate back to the LD UI, you should be able to see the logs, traces, and errors under the Monitor section. Logs Traces Errors Conclusion In just a few minutes, we’ve taken the PlusOne React Native app from a simple counter to a fully observable application connected to LaunchDarkly. By setting up the SDK, initializing observability plugins, and recording errors, we now have a live feedback loop where application behavior is visible in the LaunchDarkly UI. This makes it far easier to diagnose issues, validate feature flag rollouts, and ensure smooth user experiences. Next Steps Looking ahead, there are many ways to expand on what we’ve built by including features like recording custom metrics and session replay, which provide even deeper insights into app behavior. By integrating observability at the foundation of your React Native projects, you equip your team with the clarity needed to debug faster, ship features more confidently, and deliver reliable experiences to your users. You can also read this article to learn more about observability and guarded releases.

By Alexis Roberson

Why AI-Generated Code Is Making Regression Testing More Important, Not Less

There is a widespread assumption circulating in engineering teams right now that goes something like this: if AI can write code faster, it probably makes testing less of a bottleneck too. The logic seems reasonable on the surface. Faster code, faster tests, faster everything. This assumption is wrong, and teams that act on it are going to find out the hard way. AI-generated code does not reduce the need for regression testing. It amplifies it. And the teams that understand this early will have a significant quality advantage over those that do not. The Fundamental Misunderstanding When developers use AI coding assistants to generate functions, services, or entire modules, they are not producing code that has been verified against the real behavior of their system. They are producing code that is syntactically correct and structurally plausible, written by a model that has no knowledge of how their specific application actually runs in production. This is a critically important distinction. A human developer who has worked on a codebase for months carries implicit knowledge about which edge cases matter, which downstream services are flaky, and which data patterns appear in production that were never anticipated in the original requirements. An AI model has none of this context. It produces code that looks right and often is right for the happy path, but it has no way of knowing what the code needs to handle in the real world. The result is a class of defects that regression testing is uniquely positioned to catch: behaviors that work in isolation but break in the context of the full system. The Velocity Trap Here is where teams get into trouble. AI coding tools are genuinely fast. Developers using them can produce working code at a rate that was not possible before, and the productivity gains are real. But velocity without verification is just a faster path to production failures. The pattern plays out predictably. A team adopts AI coding assistance, development speed increases, the engineering leadership is happy, and everyone agrees to keep moving fast. What nobody adjusts is the regression testing strategy. The test suite that was sized for the previous pace of development is now covering a larger surface area of code, generated at higher volume, by a process that has no awareness of production context. Coverage gaps compound quietly. Nobody sees them until something breaks in production in a way that takes two days to trace back to a function that an AI wrote last sprint and nobody fully read. What AI-Generated Code Actually Gets Wrong The failures that emerge from inadequate regression coverage of AI-generated code tend to cluster in specific areas. Integration points are the most common failure zone. AI generates code based on interfaces and contracts. It looks at API signatures, function definitions, and data schemas. What it cannot see is how those contracts actually behave when real traffic flows through them. Consider a realistic scenario: an AI-generated service calls a downstream payment processor using the documented API specification. The code is technically correct. But the payment processor returns a slightly different response shape when a transaction is declined due to insufficient funds versus when it is declined due to a card expiry. The specification documents neither distinction. The AI has no way to know they exist. A regression suite built from real production traffic would catch this within the first test run. A regression suite built from the same specification the AI used to write the code will not catch it until a customer sees a wrong error message in production. Mock drift compounds the problem. When tests for AI-generated code are written using mocked dependencies, those mocks represent what the developer or AI thought the dependency would do. Over time, the real dependency changes and the mocks do not. Tests keep passing, the real behavior keeps drifting, and the regression suite provides false confidence rather than real coverage. AI-generated code optimizes for the stated requirement. It handles the case described in the prompt competently. It does not handle the cases that were not in the prompt: the empty array that should return a specific error, the timestamp that crosses a timezone boundary, the concurrent request that triggers a race condition. These are edge cases that only emerge from real usage patterns, and they are precisely what a regression suite built from real traffic catches where tests written from requirements do not. The Regression Testing Response Understanding these failure modes points directly to what needs to change in regression testing strategy when AI-generated code becomes part of the development process. Test generation needs to be grounded in real behavior, not assumed behavior. The traditional model of writing tests based on requirements becomes increasingly insufficient when the code being tested was generated by a model that had access only to those same requirements. The regression suite ends up testing exactly what the AI thought the code should do. Tests need to be grounded in what the system actually does when real requests flow through it. Integration test coverage becomes more important than unit test coverage. AI-generated code can usually pass unit tests because it generates syntactically correct implementations of isolated functions. The failures emerge at integration points. Regression testing that focuses on the integration layer, verifying that services interact correctly under realistic conditions, catches the class of failures that AI-generated code is most likely to introduce. Regression coverage should update continuously rather than incrementally. The pace of development with AI assistance creates a situation where code is being added to the codebase faster than manual test authoring can keep up. If the regression suite is maintained manually, it will always be behind. Coverage needs to grow with the codebase automatically, derived from real usage rather than added by developers who are already stretched by higher output demands. Production behavior should feed back into test validation. Closing the loop between how the system behaves in production and what the regression suite is testing is one of the most important shifts a team can make. When tests are derived from actual production traffic rather than written specifications, the mock drift problem largely disappears because the tests reflect what services actually do, not what developers assumed they would do. The Counter-Intuitive Conclusion There is a temptation to see AI-generated code and automated testing as solving the same problem from different angles. If AI can generate both the code and the tests, the reasoning goes, maybe the coverage problem solves itself. It does not. An AI that generates code and then generates tests for that code is essentially testing its own assumptions about how the code should behave. It will consistently produce tests that pass against the code it wrote, and those tests will systematically miss the gap between what the AI thought the code should do and what the system actually needs to do under production conditions. The gap between AI intent and production reality is exactly where regression testing has always been most valuable. AI-generated code makes that gap wider, not narrower, because the code is being written by something with no production experience at all. The teams that treat AI coding assistance as a reason to invest less in regression testing will eventually face production incidents that trace directly to this decision. The teams that treat it as a reason to invest more, particularly in coverage grounded in real system behavior rather than written specifications, will find that AI assistance genuinely accelerates development without accumulating the hidden quality debt that comes with uncovered integration failures. The Bottom Line Regression testing was never just a safety net. It is the mechanism by which a team validates that their understanding of the system matches how the system actually behaves. When AI is generating the code, that validation matters more than ever, because the code is now written by something that has never seen your system run. Invest accordingly.

By Sancharini Panda

A Low-Latency Routing Pattern for Multiple Small Language Models

A multi-SLM platform creates value only when specialization does not introduce a new latency tier. Small language models are inexpensive enough to dedicate to focused work such as extraction, code handling, safety filtering, or short-form reasoning, but that advantage disappears if model selection itself becomes expensive. Research on LLM routing shows that query difficulty varies enough for model choice to materially affect efficiency and quality, and modern serving stacks expose enough control over routing, batching, and cache locality to turn that insight into an operational design rather than an academic one. In practice, the routing layer has to behave like a tiny data-plane decision engine, not like another inference hop. Why Multiple SLMs Need Routing A single small model rarely gives the best latency-quality trade-off for every prompt type. Short structured requests, such as JSON extraction and classification, differ sharply from code repair, and both differ again from prompts that need broader reasoning. RouteLLM describes routing as assigning simpler queries to weaker models and reserving stronger models for harder cases, while FrugalGPT reports that a learned cascade can preserve strong-model quality with very large cost reductions. Although those papers evaluate broader LLM portfolios, the underlying lesson transfers cleanly to a fleet of small specialized models: heterogeneity in request shape makes heterogeneity in model choice economically and operationally rational. That conclusion rules out a router that behaves like another generative model call. RouteLLM explicitly treats effective routing as a pre-decision that minimizes cost and latency relative to broader multi-model execution, which means the dominant path should remain inside in-memory feature extraction and lookup. Prompt length, requested output shape, language, code markers, safety category, session identity, and prior cache affinity are all signals that can be computed before any model is invoked. A practical design target is to keep that first decision under a millisecond, so its cost remains far below prefill and decode work. The moment the main path depends on an additional model inference, the latency budget starts competing with the very SLM call it is supposed to optimize. Keeping the Decision Path Short The cleanest design is a two-stage router. The first stage is deterministic and resolves obvious cases immediately. A short request demanding strict JSON can go to an extraction model. A prompt containing fenced code, compiler errors, or repository paths can go to a code model. A safety-sensitive request can be pinned to a policy model. Only when simple predicates fail to produce a confident mapping should the second stage run, and that second stage should be a lightweight complexity scorer rather than another generator. Ray Serve’s request-routing API is built around this kind of custom replica selection, and its FIFO mixin is specifically intended for algorithms that can route requests as soon as they arrive without waiting for content-heavy processing. That is the right shape for an ultra-low-latency router: deterministic fast path first, optional scorer second. A routing metadata object makes that design practical because it compresses request interpretation into cheap primitives: Java record RoutingContext( int tokenCount, boolean codeRequest, boolean structuredOutput, String language, boolean repeatedPrefix, double complexityScore ) {} This record is deliberately plain. Primitive fields are cheap to serialize, cheap to log, and easy to replay during debugging. That choice aligns with PyTorch and vLLM production notes on disaggregated serving, where complex metadata objects in scheduler paths increased serialization cost and hurt inter-token behavior, and it fits the general shape of request routers that repeatedly rank candidate replicas under load. The complexityScore field should therefore come from a compact classifier or calibrated heuristic trained offline on task outcomes, escalation rates, or preference labels, not from a runtime SLM call. The router’s intelligence belongs in the thresholds and features, not in an extra generation step. The routing function should then read like admission control rather than orchestration: Java ModelTarget route(RoutingContext ctx) { if (ctx.structuredOutput() && ctx.tokenCount() < 800) return ModelTarget.EXTRACTION_SLM; if (ctx.codeRequest()) return ModelTarget.CODE_SLM; if (ctx.complexityScore() > 0.72) return ModelTarget.REASONING_SLM; if (ctx.repeatedPrefix()) return ModelTarget.GENERAL_SLM_CACHE_HOT; return ModelTarget.GENERAL_SLM; } The important detail is ordering. The cheapest predicates run first, the optional scorer appears only after clear task signals have been checked, and cache affinity refines the generic path instead of overriding obvious specialization. That mirrors how high-performance request routers rank candidates and then filter out replicas that are already saturated. Thresholds should be calibrated from observed latency and task-success data, but the architectural rule is stable: most traffic should leave the router with a decision produced entirely from fields already in memory. Making Selection Cache-Aware Cache-aware selection is where routing often starts to produce visible latency gains. vLLM’s automatic prefix caching reuses KV cache from earlier queries when a new request shares the same prefix, allowing shared prompt computation to be skipped, and its design notes describe prefix caching as close to a free lunch because it avoids redundant work without changing outputs. SGLang reaches a similar result with RadixAttention, which keeps reusable KV state in a radix tree, adds LRU eviction, and applies cache-aware scheduling to improve hit rate while introducing only negligible overhead when no cache hit occurs. That combination matters because a fast model on a warm prefix can easily outperform a nominally better model on a cold path. Routing without cache awareness, therefore, leaves substantial latency savings on the table. That is why a field such as repeatedPrefix, promptFamilyId, or session hash belongs in the routing context. Ray Serve exposes locality-aware and multiplex-aware helpers so that requests can prefer nearby replicas or replicas that already hold the relevant model, and Meta’s PyTorch and vLLM production write-up reports that sticky routing of the same session to the same prefill host significantly boosts prefix-cache hit rate, reaching 40% to 50% hit rate in the described deployment. The practical lesson is broader than that specific architecture. Similar prompt families should be steered toward the same warm replicas whenever possible, even if a purely load-balanced policy would have spread them evenly. Equal distribution is not the same thing as minimal latency once KV reuse becomes available. Keeping the System Fast in Production Once the routing logic is correct, the queueing policy and replica shape become the next sources of latency. Triton documents that dynamic batching combines requests to maximize throughput and allows bounded queue delay, while concurrent model execution and instance groups allow multiple copies of the same model to run in parallel on selected devices. That argues for selective rather than universal batching. Short extraction or moderation SLMs often benefit from aggressive batching because their service time is small and predictable, while interactive reasoning models need tighter queue-delay bounds to prevent batching from inflating p95 latency. Replica placement matters as well. Heavy or frequently chosen models deserve more parallel instances, and cold-start penalties should be reduced through explicit warmup, since Triton notes that model warmup can prevent the slow initial inferences seen before a model is fully initialized. Backpressure and observability complete the design. Ray Serve supports bounded queues and load shedding through max_queued_requests, and its autoscaling guidance ties lower ongoing-request targets to tighter latency objectives. Ray Serve LLM also exposes request latency, throughput, TTFT, and TPOT, while Triton exposes Prometheus metrics for GPU and request behavior. Those signals should be segmented by routed model, decision path, cache-hit class, and warm versus cold replica so that routing regressions become visible before they surface as user-facing tail latency. Without route-level telemetry, an apparently accurate router can quietly push traffic onto cold replicas, oversized queues, or cache-miss-heavy paths. In a low-latency SLM system, observability is not just for debugging. It is the only reliable way to keep routing policy aligned with actual serving behavior. Conclusion An ultra-low-latency routing layer for multiple SLMs is best treated as a serving primitive rather than as a separate intelligence feature. The strongest design keeps most requests on a deterministic first stage, invokes a lightweight complexity scorer only for ambiguous prompts, represents route state with compact metadata, and treats prefix locality as a first-class selection signal. Around that core, warm replicas, selective batching, bounded queues, and route-level observability determine whether specialization actually improves latency or merely rearranges it. When routing is cheaper than a single token step and cache locality is preserved instead of ignored, a multi-SLM system stops looking like a collection of models and starts behaving like a disciplined low-latency inference fabric.

By Akhil Madineni

Performance

DZone's Featured Performance Resources

Top Performance Experts

The Latest Performance Topics