DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Performance

Performance refers to how well an application conducts itself compared to an expected level of service. Today's environments are increasingly complex and typically involve loosely coupled architectures, making it difficult to pinpoint bottlenecks in your system. Whatever your performance troubles, this Zone has you covered with everything from root cause analysis, application monitoring, and log management to anomaly detection, observability, and performance testing.

icon
Latest Premium Content
Trend Report
Observability and Performance
Observability and Performance
Refcard #290
Getting Started With Log Management
Getting Started With Log Management
Refcard #385
Observability Maturity Model
Observability Maturity Model

DZone's Featured Performance Resources

Compliance Automated Standard Solution (COMPASS), Part 10: How OSCAL Mapping Paves the Way for Continuous Compliance Scalability

Compliance Automated Standard Solution (COMPASS), Part 10: How OSCAL Mapping Paves the Way for Continuous Compliance Scalability

By Vikas Agarwal
(Note: A list of links for all articles in this series can be found at the conclusion of this article.) The Scalability Wall In previous posts of this COMPASS series, we demonstrated how OSCAL enables compliance-as-code from Catalogs through Component Definitions, to System Security Plans (Part 3), how Compliance Policy Administration Centers bridge compliance to policy enforcement (Parts 4–7), and how these patterns scale to complex environments (Part 9). Yet organizations still hit a fundamental bottleneck: the relentless proliferation of regulatory frameworks. Consider a financial services firm operating globally. They must simultaneously satisfy DORA in the EU, PCI DSS for payment processing, SOC 2 for SaaS customers, ISO 27001 for international contracts, NIST 800-53 for federal clients, and state-specific privacy regulations. Each framework brings hundreds of controls requiring documentation, implementation, testing, and evidence collection. The traditional approach treats each as an independent program with separate teams, spreadsheets, and evidence repositories. When a new regulation emerges — NIST AI RMF or ISO 42001 — the organization stands up yet another parallel program. This doesn’t scale. Operational overhead grows quadratically while security posture improves marginally. The OSCAL v1.2.1 Mapping Model, introduced in March 2026, provides the architectural solution. By enabling systematic, machine-readable mappings between control frameworks, OSCAL transforms multi-framework compliance from an O(N²) problem to O(N). The result: evidence reuse, automated gap analysis, and genuinely scalable continuous compliance. Why Spreadsheet Mappings Failed The compliance industry has always understood that frameworks overlap. Multi-factor authentication appears in NIST 800-53 as IA-2, PCI DSS as Requirement 8.3, ISO 27001 as A.9.4.2, and SOC 2 as CC6.1. Implementing MFA once should count toward all four frameworks. The challenge has been formalizing these relationships for compliance teams, auditors, and automation tooling. The predominant approach — manual crosswalk spreadsheets from consulting firms — suffers from fundamental flaws. They lack semantic precision. Does “maps to” mean controls are equivalent, or that one subsumes the other, or merely that they’re related? They exist as static documents disconnected from actual OSCAL artifacts. When frameworks update — PCI DSS v3.2.1 to v4.0, NIST 800-53 Rev 4 to Rev 5 — spreadsheets become instantly stale. Most critically, they’re human artifacts, not machine-readable structures that automation can reason over. OSCAL Mapping (see Figure 1) addresses each limitation. Mappings use formal set theory relationships: equal (identical requirements), subset (A’s requirements fully contained in B), superset (A broader than B), intersects (partial overlap requiring delta analysis), and not-applicable (no relationship, important for gap identification). Mappings are OSCAL documents following the same schema and validation rules as other artifacts. They version control alongside catalogs and SSPs, compose through standard reference mechanisms, and integrate into GitOps workflows. Mappings are bidirectional and composable, supporting Framework A to B, B to A, and transitive chains like EU AI Act to ISO 42001 to NIST 800-53. When mappings are machine-readable, tooling automatically identifies which controls in a new framework are satisfied by existing implementations, which require deltas, and which are entirely new obligations. When mapping version control alongside frameworks, updates trigger automated validation rather than silent staleness. Figure 1: Open Safety Controls Assessment Language Models, with red highlight on the newly released Mapping Model 4 Architectural Patterns OSCAL community experience across government agencies, enterprise IT, and emerging regulatory domains has converged on four patterns. Pattern 1: Version-to-Version Mapping Version-to-version mapping addresses regulation evolution within the same framework. When PCI DSS transitions from v3.2.1 to v4.0 or NIST 800-53 updates from Revision 4 to Revision 5, organizations face expensive manual analysis determining which existing implementations remain compliant and which need updates. Version mappings create explicit relationships between control versions using the same semantic types. For requirements with equal relationships — where fundamental obligations remain unchanged despite text clarifications — existing implementations automatically satisfy the new version with no work. For subset relationships — where the new version narrows the scope — implementations may now exceed requirements. For superset relationships — where a new version adds requirements — the mapping identifies exactly which delta implementations are needed. For intersects relationships — where a control splits or merges between versions — the mapping documents the partial overlap and flags affected implementations for review. NIST published official mappings from 800-53 Rev 4 to Rev 5, enabling organizations to query the OSCAL Mapping document and determine precisely which controls need reassessment. This automated impact analysis, traditionally requiring weeks of manual document comparison, now completes in hours and produces actionable gap reports showing new, modified, and unchanged requirements. Pattern 2: Direct Framework Mapping Direct mapping applies when an organization has established compliance with one framework and must demonstrate compliance with a second. A company with mature NIST 800-53 compliance that needs PCI DSS certification creates an OSCAL mapping artifact documenting relationships between NIST and PCI controls using semantic relationship types. For controls with equal or superset relationships — where NIST fully satisfies PCI — the mapping provides machine-readable evidence that existing implementations already meet new requirements. For intersects relationships — where NIST partially addresses PCI — the mapping identifies exactly which delta requirements need attention. For PCI requirements with no NIST mapping, analysis immediately surfaces genuinely new obligations. This pattern typically yields 40-60% coverage through equal or superset relationships. The remaining 40-60% splits between partial coverage requiring delta implementations and no coverage requiring new implementations. Implementation time drops from 12-18 months for greenfield programs to 4-6 months for delta implementation. Pattern 3: Enterprise Baseline Mapping Enterprise baseline mapping addresses organizations maintaining proprietary internal control frameworks that must map multiple external regulations to that baseline. IBM’s IT Security Standard (ITSS), Google’s Control Framework, and similar enterprise-specific sets represent distillations of institutional practices refined over decades. External regulations map to the baseline, not to technical implementations. Technical implementations documented as Component Definitions map to baseline controls. Assessment results aggregate to the baseline. When auditors request framework-specific evidence, the organization computes it through the two-hop relationship: external framework to baseline to implementation evidence. Adding a new framework requires mapping it to the baseline — an O(N) operation. Critically, it doesn’t require touching Component Definitions describing technical implementations or evidence collection infrastructure. The organization escapes O(N²) complexity, where each new framework forces a review of all existing mappings. The baseline provides stability and abstraction that direct external-framework-to-implementation mappings cannot offer. Pattern 4: Harmonized Framework Construction Harmonized framework approach addresses organizations subject to multiple regulations simultaneously, where maintaining separate programs creates unacceptable overhead. The organization constructs a single unified catalog representing the superset of all applicable requirements. All Component Definitions implement controls from this harmonized catalog. Framework-specific compliance reports are computed by mapping the harmonized catalog back to each source regulation. Construction begins with a foundational framework providing broad coverage, typically NIST 800-53. The organization maps the second framework to NIST, identifying requirements already covered, those requiring catalog extensions for deltas, and those representing entirely new obligations. The process repeats for each additional framework, mapping to the growing harmonized catalog. Component Definitions implement harmonized controls once, automatically satisfying multiple frameworks simultaneously through mapping relationships. A single implementation for multi-factor authentication satisfies the harmonized control, which, through mappings, simultaneously satisfies NIST IA-2, PCI 8.3, ISO A.9.4.2, and any other mapped requirement. Real-World Impact IBM’s internal IT compliance manages compliance across more than 40 frameworks globally. The shift to OSCAL-based compliance with formal mappings uses ITSS as the enterprise baseline (Pattern 3). When the EU’s Digital Operational Resilience Act introduced new requirements, automated gap analysis through DORA-to-ITSS mapping identified that 65% of DORA requirements have equal or superset relationships to existing ITSS controls, existing implementations fully satisfy these with no new work. Another 25% have intersects relationships requiring delta implementations. The remaining 10% represent genuinely new obligations. This gap analysis, requiring months of manual review pre-OSCAL, now completes in hours. IBM reduced average time-to-compliance from 12-18 months to 4-6 months, achieved 70% reduction in duplicate documentation, and cut ongoing assessment effort by 3x through evidence reuse. Integration With Continuous Compliance OSCAL mappings integrate seamlessly into the compliance-to-policy workflows detailed in earlier parts of this series. Without mappings, separate Component Definitions would document pod security policies against NIST controls and again against PCI controls, duplicating validation logic. With mappings, a harmonized catalog contains a control for pod security policies with documented mappings to both NIST 800-53 SC-7 and PCI DSS Requirement 2.2. A single Component Definition implements this harmonized control by referencing OPA policies validating pod security. When OPA validation runs and collects evidence, that evidence is recorded against the harmonized control. Assessment result computation applies the mappings: evidence satisfying the harmonized control automatically counts toward both NIST SC-7 and PCI 2.2. Validation logic is written once, maintained once, executed once, and produces evidence once — but that single evidence artifact satisfies multiple regulatory obligations simultaneously through the mapping layer. The mapping layer enables unified compliance posture dashboards showing which technical implementations satisfy which framework requirements through which mapping relationships. The Path Forward A critical challenge is mapping quality and consensus. The relationship between controls can be interpreted differently by different domain experts. The OSCAL Foundation addresses this through collaborative mapping development in public repositories. Mappings are authored in Git using Trestle-based workflows. Pull requests propose mappings or modifications. Domain experts from multiple organizations review and discuss each mapping, documenting disagreements and consensus rationale. Approved mappings merge into canonical repositories with clear provenance and confidence scores. NIST has published official mappings between NIST 800-53 Revision 5 and NIST Cybersecurity Framework 2.0. FedRAMP is developing mappings from FedRAMP baselines to broader NIST 800-53 controls. The OSCAL Foundation community is creating mappings between major frameworks like ISO 27001, PCI DSS, SOC 2, and NIST 800-53. These mappings version control with clear change tracking, update when frameworks evolve, and accumulate community feedback, improving accuracy. The next frontier involves extending mappings to emerging regulatory domains. The AI safety landscape — NIST AI RMF, EU AI Act, ISO 42001 — represents a new compliance frontier where mapping infrastructure can prevent fragmentation. By establishing OSCAL mappings between AI regulations from the outset, the community can enable organizations to implement AI safety controls once and automatically satisfy multiple frameworks through mapping relationships. The next article in this series will demonstrate how these layers combine to address AI safety and governance, showing how Component Definitions for AI technical stack components (KServe, LangChain, MLflow) map to AI-specific regulations (NIST AI RMF, EU AI Act, ISO 42001) through OSCAL mapping mechanisms, enabling continuous compliance at scale for AI systems. References OSCAL Mapping Resources: OSCAL v1.2.1 MappingOSCAL Mapping Best Practices White Paper: Community review at LF AI & Data Security & Compliance WG OSCAL Foundation Community Presentations: NIST OSCAL Workshop (July 2025): “Collaboratively Maturing the OSCAL Control Mapping Model”SunStone OSCAL PlugFest (May 2025): “OSCAL Mapping Scenarios and Process” Tools: Compliance-TrestleNIST OSCAL The authors welcome collaboration through the OSCAL Foundation and LF AI & Data Security & Compliance Working Group. Below are the links to other articles in this series: Compliance Automated Standard Solution (COMPASS), Part 1: Personas and RolesCompliance Automated Standard Solution (COMPASS), Part 2: Trestle SDKCompliance Automated Standard Solution (COMPASS), Part 3: Artifacts and PersonasCompliance Automated Standard Solution (COMPASS), Part 4: Topologies of Compliance Policy Administration CentersCompliance Automated Standard Solution (COMPASS), Part 5: A Lack of Network Boundaries Invites a Lack of ComplianceCompliance Automated Standard Solution (COMPASS), Part 6: Compliance to Policy for Multiple Kubernetes ClustersCompliance Automated Standard Solution (COMPASS), Part 7: Compliance-to-Policy for IT Operation Policies Using AuditreeCompliance Automated Standard Solution (COMPASS), Part 8: Agentic AI Policy as Code for Compliance Automation With Prompt Declaration LanguageCompliance Automated Standard Solution (COMPASS), Part 9: Taking OSCAL-Compass to Industry Complexity Level More
Stop Debugging Glue Jobs Manually: Building an Agentic Observability Layer for Data Pipelines

Stop Debugging Glue Jobs Manually: Building an Agentic Observability Layer for Data Pipelines

By Vivek Venkatesan
The Pipeline Did Not Fail Cleanly Most pipeline failures don't look like "the job failed." Consider a common scenario. A Glue job reads overnight event files, applies business rules, and writes to an Iceberg curated table. The job runs at its scheduled time and errors out partway through. The control table shows SUCCESS for the previous batch and FAILED for the current one, which is what you'd expect. The problem is what happened between those two states: the job wrote nine of the day's twelve partitions to the staging table before failing. A downstream report ran on its own schedule, picked up the partial data, and the discrepancy didn't surface until a downstream consumer noticed records were missing. By the time someone looks at the failure, the question is no longer "Why did the job fail?" It's "Is it safe to rerun, and what's already corrupted downstream?" That's where debugging gets messy. CloudWatch logs, Glue run metadata, the source S3 path, record counts, data quality results, target table state, and Iceberg snapshots. An experienced engineer can connect those signals, but it takes time, and a less experienced engineer often misses one. In a busy production environment that delay leads to blind reruns, duplicate records, overwritten partitions, or worse. The frustrating part is that the evidence existed. The pipeline just had no structured way to explain itself. That's the gap a triage layer can fill. Not by fixing the pipeline. Not by changing schemas. Not by restarting jobs. By observing the evidence already produced, classifying the failure, explaining what likely happened, and recommending what to do next. What Agentic Observability Means The word "agentic" gets misused a lot right now, especially in data engineering. It's worth being precise. An agentic observability layer is not an LLM with permission to control production. It's a controlled workflow that collects pipeline evidence, builds incident context, classifies the failure against known categories, and produces a structured recommendation. The loop is observe, classify, explain, recommend, and that's where it stops. Everything past "recommend" stays with engineers, deterministic rules, or approval workflows. The difference from normal alerting is the depth of the output. A normal alert says "Glue job daily_customer_interactions failed." An agentic observability layer should produce something closer to: "The job failed because the input contains a new column not present in the curated schema. The staging write started before the failure, so a blind retry will create duplicate records. Quarantine the batch, review the schema contract, and rerun with the same batch_id after validation." That difference is what saves time during an incident. The goal isn't replacing engineers. It's reducing the manual triage work needed before someone can make a real decision. Reference Architecture This does not need to start as a new platform. The triage layer can sit beside existing Glue pipelines and consume signals that already exist. Figure 1. Agentic observability flow for AWS Glue pipelines. Pipeline evidence is collected, converted into structured context, analyzed by an LLM triage layer, and returned as a structured incident output. The component that matters most here is the incident context builder. The LLM should never receive a raw dump of ten thousand log lines. That produces noisy, low-confidence output and burns tokens. The collector should pull a curated set of signals: Glue job name and run ID, status and duration, batch ID, source path, target table, the last fifty error log lines, data quality results, record counts, attempt count, recent deployment version, table snapshot or commit ID, and control table status. That's enough context to analyze the failure without guessing from disconnected log lines. Where This Fits Before going further, one thing worth being honest about: this pattern depends on the platform already having its house in order. The agent can only work with the observability that the platform already has. It is not a substitute for basic pipeline hygiene. It works when the platform tracks batch IDs, clear source paths, data quality results, structured logs, table commits, deployment versions, and ownership mapping. Without those signals, the agent has very little to reason over. If a pipeline doesn't track batch IDs, the agent can't reliably tell whether a run is a retry or a new batch. If quality results aren't stored, it can't reason about input validity. If table commits aren't tracked, it can't tell whether the failure happened before or after a write. LLMs don't create observability. They summarize and reason over the observability that already exists. The teams that get the most out of this pattern are the ones with disciplined data engineering underneath. Failure Categories Manual debugging takes time, partly because every failure looks unique at first glance. Most don't stay unique once you classify them. A small fixed set of categories makes the output easier to review, compare, and route. Failure categoryCommon signalsRecommended actionSchema driftNew column, missing column, cast failure, contract mismatchQuarantine the batch and review the schema contractData skewLong-running tasks, shuffle spill, uneven partitionsRepartition or isolate skewed keysSmall file pressureHigh file count, slow planning, frequent commitsCompact affected partitionsSource delayMissing input path, low record count, late file arrivalWait, retry later, or mark the batch delayedCode regressionRecent deployment plus transformation errorRoll back or compare with the previous runPermission issueAccess denied, catalog failure, IAM or Lake Formation errorFix access policy before retryingPartial write riskFailure after write startedCheck staging and control tables before rerunUnknownWeak or conflicting evidenceEscalate to an engineer with summarized context The category list isn't only documentation. It's part of the system contract. The agent picks from this list rather than inventing categories on each run, which makes downstream routing tractable. Schema drift can go to the data contract owner. Permission issues route to the platform team. Source delays go to the ingestion owner. Partial write risk triggers a manual review workflow rather than auto-retry. This is what makes the system more useful than a chatbot that summarizes logs. Structured Incident Output The output should also be structured. Free-form summaries help humans skim, but they're hard to store, compare, or evaluate over time. JSON works better because it can be written to an incident table and consumed by Slack, Teams, Jira, or ServiceNow without parsing prose. JSON { "pipeline_name": "daily_customer_interactions", "job_run_id": "jr_2026_05_02_001", "status": "FAILED", "failure_category": "SCHEMA_DRIFT", "likely_root_cause": "Input file contains a new column named device_type that is not defined in the curated table schema.", "affected_source_path": "s3://raw/events/date=2026-05-02/", "affected_table": "curated.customer_interactions", "safe_to_retry": false, "recommended_action": "Quarantine the batch, update the schema contract, and rerun with the same batch_id after validation.", "confidence": 0.87 } A structured output gives engineers a quick summary, and it gives downstream tools something reliable to use. If safe_to_retry is false, the orchestrator blocks automatic retry. If failure_category is PERMISSION_ERROR, the issue routes to the platform queue. If confidence is low, the system asks for human review. If the same failure category recurs across runs, dashboards can track it over time. One important framing point: the LLM is not the system of record. The control table, logs, table metadata, and quality checks remain the source of truth. The agent is a reasoning layer that produces structured evidence on top of that. Implementation Sketch A simple implementation starts with assembling the incident context. The example below is intentionally simplified. In production, the LLM call should use structured outputs or schema-validated responses rather than free-form text parsing. Python def build_incident_context(job_run, control_record, dq_results, recent_logs): return { "job_name": job_run["JobName"], "job_run_id": job_run["Id"], "status": job_run["JobRunState"], "started_on": str(job_run["StartedOn"]), "completed_on": str(job_run.get("CompletedOn")), "batch_id": control_record.get("batch_id"), "source_path": control_record.get("source_path"), "target_table": control_record.get("target_table"), "attempt_count": control_record.get("attempt_count"), "control_status": control_record.get("status"), "data_quality_results": dq_results, "recent_error_logs": recent_logs[-50:] } The classifier receives a fixed category list and explicit rules about what it shouldn't recommend. Python def classify_failure(llm_client, incident_context): prompt = f""" You are analyzing a failed data pipeline run. Classify the failure into one of these categories: SCHEMA_DRIFT, DATA_SKEW, SOURCE_DELAY, PERMISSION_ERROR, CODE_REGRESSION, PARTIAL_WRITE_RISK, SMALL_FILE_PRESSURE, UNKNOWN. Return only valid JSON with: failure_category, likely_root_cause, safe_to_retry, recommended_action, confidence. Rules: - Do not recommend a retry if there is partial write risk. - Do not recommend schema changes without human review. - Do not recommend permission changes without platform approval. - Use UNKNOWN when evidence is weak or conflicting. Incident context: {incident_context} """ return llm_client.invoke(prompt) In a real implementation, this prompt should be paired with a strict response schema (failure_category as an enum, likely_root_cause as a string, safe_to_retry as a boolean, recommended_action as a string, confidence as a float between 0 and 1), and the system should reject any output that doesn't match. In production, structured outputs are the better choice when the API supports them. The free-form prompt above is illustrative. The result gets stored, not acted on: Python def store_incident_summary(summary, incident_table): incident_table.put_item( Item={ "pipeline_name": summary["pipeline_name"], "job_run_id": summary["job_run_id"], "failure_category": summary["failure_category"], "safe_to_retry": summary["safe_to_retry"], "recommended_action": summary["recommended_action"], "confidence": summary["confidence"], "created_at": current_timestamp() } ) The agent writes an explanation. Other systems decide what to do with it. What the Agent Should Never Decide This boundary is the most important design choice in the whole pattern, and it's worth being explicit about. An observability agent helps engineers understand a failure. It does not control production data systems. Even at high confidence, certain actions stay out of scope: Changing table schemasGranting IAM or Lake Formation permissionsDeleting dataMarking a partially written batch as successfulOverriding data quality failuresPromoting quarantined dataRewriting production tablesTriggering cross-pipeline backfillsCompacting or expiring table snapshots without approval These actions move from observability into production control, and that line should stay clear. In regulated or business-critical environments, the safest design lets the agent produce structured evidence and recommendations while deterministic rules, approval workflows, or engineers decide whether anything actually executes. An agent saying "this looks like schema drift, the batch is not safe to retry" is useful. The same agent updating the curated table schema on its own is not. It's a future incident waiting to happen. Same with permissions: the agent flagging an IAM issue is useful; the agent granting itself access is a security violation. The trade-off here is real. Letting the agent take action would reduce the mean time to recovery. But the cost of a confident wrong action (silently corrupted data, an unauthorized permission grant, a dropped partition) is much higher than the cost of a few extra minutes of human review. In a regulated data environment, that trade-off is usually easy to justify. This matters as teams move toward self-healing pipelines. Before a pipeline can safely fix itself, it has to first explain itself reliably, at scale, with measurable accuracy. That bar isn't met yet in most production environments. Evaluating the Triage Layer A triage layer should be evaluated like any other production component. "The summary looks good" is not an evaluation. To check whether the pattern behaves reasonably, a small synthetic evaluation can be assembled across common Glue failure modes. Each scenario includes a short set of log lines, control-table state, data quality results, and table metadata, and the agent is scored on two things: whether it picks the correct failure category, and whether the safe_to_retry decision is appropriate. This is a starter evaluation, not a benchmark. Ten synthetic scenarios are enough to sanity-check the design. A real production rollout needs hundreds of labeled historical incidents, edge cases, and human-reviewed outcomes. Anything less should be treated as an early prototype, not production validation. ScenarioExpected categoryAgent categorySafe-to-retry decisionMissing source pathSOURCE_DELAYSOURCE_DELAYCorrectNew column in inputSCHEMA_DRIFTSCHEMA_DRIFTCorrectAccess denied on catalog tablePERMISSION_ERRORPERMISSION_ERRORCorrectShuffle spill and one long taskDATA_SKEWDATA_SKEWCorrectFailure after staging writePARTIAL_WRITE_RISKPARTIAL_WRITE_RISKCorrectToo many small filesSMALL_FILE_PRESSURESMALL_FILE_PRESSURECorrectRecent code deployment plus null pointerCODE_REGRESSIONCODE_REGRESSIONCorrectLow record count, no hard errorSOURCE_DELAYUNKNOWNConservative escalationCast failure due to bad input valueSCHEMA_DRIFTSCHEMA_DRIFTWrong, recommended retryConflicting log signalsUNKNOWNUNKNOWNCorrect escalation In a small evaluation like this one, a well-designed classifier should pick the expected category in most scenarios and, more importantly, get the safe-to-retry decision right in nearly all of them. The illustrative results above show eight correct retry decisions, one conservative escalation (the agent returns UNKNOWN rather than guessing), and one wrong call. That wrong call is the most instructive. On the cast failure, the agent classifies the issue correctly as schema drift but recommends cleanup-and-retry instead of quarantine-and-contract-review. A wrong root cause is inconvenient. A wrong retry recommendation can corrupt data. Safe-retry precision should be weighted higher than classification accuracy when evaluating this kind of system, and that weighting should be reflected in the prompt rules and in the validation rubric. The metrics worth tracking in production: MetricWhy it mattersClassification accuracyWhether the agent identifies the right failure typeSafe-retry precisionWhether retry recommendations are actually safeFalse confidence rateConfident-but-wrong recommendationsMean triage timeReduction in manual debugging timeHuman override rateHow often engineers reject the recommendationCost per incidentLLM and log-processing cost per failed run False confidence rate deserves attention. A low-confidence wrong answer is manageable because engineers know to scrutinize it. A high-confidence wrong answer is dangerous because teams stop scrutinizing. Confidence belongs in the output, but it should never be treated as truth. It's one signal among several in the routing decision. Closing Glue job failures aren't hard because the logs are long. They're hard because the evidence is scattered across logs, run metadata, data quality results, control tables, and table commits, and an engineer has to assemble it before deciding what to do next. An agentic observability layer turns that scattered evidence into a structured incident summary. The safest version of this pattern is controlled triage, not autonomous repair: observe, classify, explain, recommend, and stop there. Deterministic rules, approval workflows, and engineers decide what happens next. Before pipelines can fix themselves, they need to explain themselves. That's the work worth doing first. More
Data Contracts as the
Data Contracts as the "Circuit Breaker" for Model Reliability
By SRIRAMPRABHU RAJENDRAN
Every Cache Miss Is a Tiny Tax on Your Performance
Every Cache Miss Is a Tiny Tax on Your Performance
By Jayapragash Dakshnamurthy
Implementing Observability in Distributed Systems Using OpenTelemetry
Implementing Observability in Distributed Systems Using OpenTelemetry
By Mugunth Chandran
Chaos Engineering Has a Blind Spot. Agentic AI Lives in It.
Chaos Engineering Has a Blind Spot. Agentic AI Lives in It.

Your chaos experiments passed. Your RAG pipeline is lying to you anyway. I've watched this play out more times than I'd like to admit. A team runs a thorough chaos suite, including pod failures, network partitions, and database failovers. Everything recovers cleanly. Dashboards stay green. The team ships with confidence. Three weeks later, a support ticket surfaces. Then ten more. The AI is producing answers that are fluent, confident, and factually wrong. No alert fired. No SLO breached. The infrastructure never blinked. This isn't a monitoring gap you close with a better dashboard. It's a category error in how we've defined resilience for AI systems, and until you see that distinction clearly, every chaos experiment you run is measuring the wrong thing. The Assumption That's Been Quietly Wrong For fifteen years, chaos engineering has operated on one core premise: the system's meaningful state is its operational state. Is it up? Does it recover? Can it handle a node failure at 2 AM? For systems built around databases, queues, and network hops, these are exactly the right questions. The entire discipline of Chaos Monkey, Gremlin, LitmusChaos, and AWS FIS was built to answer them. Agentic AI systems break this premise at the foundation. They're not distributed systems in the traditional sense. They're reasoning systems. And reasoning systems have two states you need to care about simultaneously: State dimensionTraditional distributed systemAgentic AI systemWhat "healthy" meansService is up, latency within SLAOutputs remain grounded in source truthHow failure manifests5xx errors, timeouts, crashesSilent drift, confident wrong answersTime to detectSeconds to minutesDays to weeks — if everFailure unitRequest or serviceBehavior over timeCircuit breaker analogyTrips on error rateNo native equivalentWhat chaos testsInfrastructure recovery✗ Cannot test behavioral integrity That last row is the entire problem. As Marc Bishop, Director of Business Growth at Wytlabs, put it after his team's retrieval embeddings drifted silently under catalog updates: "Resilience for AI means validating behavior under stress, not merely surviving it." I hold U.S. Patent 12242370B2 for intent-based chaos engineering, a framework that treats intent preservation, not just infrastructure recovery, as the core testable property of a resilient system. When I developed that framework, the failure mode I was targeting was a multi-domain infrastructure losing semantic coherence under adversarial conditions. I didn't fully anticipate how precisely that same problem would show up in production LLM pipelines and how fast. What's Actually Breaking: Five Failure Modes Nobody Has Named Yet You can't test for something you haven't named. The existing chaos engineering literature has no vocabulary for AI behavioral failure. Here's a working taxonomy from production accounts across 25+ engineering teams: 1. Retrieval Drift The vector retrieval layer silently shifts toward faster, lower-precision matches after a failure event. Outputs remain structurally valid but are grounded in the wrong documents. Rafael Sarim Oezdemir, Head of Growth at EZContacts, ran chaos injection on their RAG-based customer support chatbot. His infrastructure numbers post-chaos looked perfect: 99.99% uptime, clean latency recovery, green across the board. Three days later, the chatbot was answering return policy questions incorrectly in 7% of cases. Root cause: "Our chatbot started answering return policy questions incorrectly. We diagnosed the root cause as a subtle shift in retrieval precision; our pipeline was favoring quicker, less precise vector matches post-chaos. Infrastructure recovered. The behavior of the model didn't." No existing chaos tool measures retrieval precision. That's the gap. 2. Context Amnesia Each individual component in a multi-agent pipeline appears healthy, but the end-to-end reasoning chain becomes incoherent across hops. Luis Haberlin at CallSetter AI watched this unfold in a voice agent for an insurance brokerage: "The infrastructure was bulletproof... but often into production, agents started hearing 'I already told the robot about my home and auto' from confused callers." The agent correctly retrieved policy details early in a conversation, then lost context at the 90-second mark and restarted the needs assessment from scratch. Nothing crashed. The reasoning rotted at the handoff boundary. Jacob Kalvo, CEO of Live Proxies, hit the same wall in a market analytics pipeline: "While each summary was technically provided on schedule, there were small errors beginning to creep into the output, specific market signals being under-represented, inconsistencies developing in the logic chain, and some outputs making confident assertions regarding incorrect or misleading information." Every infrastructure check passed. The reasoning chain had silently decohered. 3. Confidence-Accuracy Decoupling The model produces high-confidence, well-formatted outputs even as accuracy degrades. The system sounds more certain as it becomes less reliable. Jayanand Sagar, COO at Hyperbola Network, saw this after a partial node recovery rebuilt the retrieval index from a stale snapshot. Output quality deteriorated over 11 days, undetected: "The model never complained. The closer the degraded output was to the original, the more convincingly it generated confident-sounding responses based on outdated context." Confidence scores are not accuracy proxies. A model grounded in a degraded context will confidently state incorrect information. No infrastructure metric tells you this is happening. 4. Intent Drift Outputs gradually decohere from the original business intent without any single triggering event. Behavior changes incrementally, across dozens of interactions, with no failure timestamp to anchor an investigation. Tyler Denk, CEO of beehiiv, described a system that passed every load and failure scenario correctly in testing, then shifted over longer production cycles: "The structure of responses remained intact, but subtle inconsistencies in reasoning and formatting started appearing across different workflows. Without a defined behavioral baseline, it became impossible to determine when the system had actually started drifting." 5. Epistemic Failure The model's picture of the world becomes stale or wrong, but all reasoning over that picture continues to function correctly. The system is reasoning well, about incorrect premises. Nicolas, founder of Reddinbox, runs a production AI pipeline classifying Reddit posts in real time across thousands of threads daily. "A few months back, everything looked fine. No downtime, no errors, latency normal. But output quality had quietly decayed." Reddit's content distribution had shifted, flooded with AI-generated posts that were structurally coherent but semantically hollow, and his classifier kept returning high-confidence scores on them. His diagnosis is the sharpest framing I've seen for why infrastructure chaos is blind to this failure class: "No chaos experiment would have caught that because the failure wasn't infrastructure, it was epistemic. We had zero observability on input distribution drift. We were watching the system, not what the system was consuming." Why Agentic Pipelines Make Every One of These Worse A single degraded LLM component is a tractable problem. A multi-agent pipeline turns it into something that actively resists detection. In a traditional microservice, a degraded component returns an error, trips a circuit breaker, and gets isolated. In a multi-agent pipeline, a degraded reasoning component returns a confident output that propagates forward, amplifying the failure rather than surfacing it. Dario Ferrari, co-founder of OpenClawVPS, watched this play out firsthand when a client's RAG-based customer support system passed all infrastructure tests but then silently shifted retrieval behavior after a network partition: "AI infrastructure that survives every test but provides incorrect answers is still resilient but fails its job badly." The blast radius of an undetected reasoning failure grows with every agent hop. By the time users notice, it has compounded through multiple layers of stored state. The Missing Layer: Behavioral Assertions Brandy Hastings, SEO Strategist at SmartSites, described the realization her team came to after AI-assisted workflows passed every infrastructure check but degraded in production: "We realized our testing didn't account for output quality over time. We were validating uptime, not alignment." That gap between uptime and alignment is where every one of the five failure modes above lives. Most teams have three layers of observability, and only two of them are working: Layer 2 is where all the interesting failures live, and it's completely absent from most production stacks. Building it requires three things your current chaos practice almost certainly lacks: Behavioral contracts – not "returns a 200 response" but "returns a response with retrieval precision above threshold X when operating on a degraded index." These are the AI equivalent of SLOs, except the metric is semantic rather than operational. Intent-preserving chaos experiments – injecting failures at the data layer, retrieval layer, and reasoning layer, not just infrastructure. Each experiment needs an exit criterion that includes behavioral scoring against a fixed ground-truth set, not just recovery metrics. Post-chaos behavioral scoring – sampling outputs after every chaos run and scoring them against a baseline. Jayanand Sagar put a concrete benchmark on the minimum viable version: "An exponential run of chaos should pass behavioral standards to be within 3 to 5 percent of baseline scores of at least 50 sampled outputs before a system is declared stable." Jake Waldrop, Co-Founder of Recademics – a regulated outdoor safety certification platform, independently arrived at this same framing: "Semantic monitoring fills the gap between AI health and user safety by verifying what the AI is saying. My most significant change was to run adversarial prompts on standard stress tests to understand whether the model logic would collapse. Chaos engineering will have a colossal safety advantage when behavioral checks are integrated into any company operating within highly regulated industries." Oksana Fando, CDO at Truck1.eu, reached the same conclusion after equipment descriptions on their European vehicle marketplace gradually became less accurate following a data source degradation and a failure invisible to every standard metric: "We began testing the system's intent, checking whether business logic remains correct even with partial data loss." Testing system intent. That's exactly the property my patent formalizes. The fact that teams in healthcare, fintech, edtech, and European e-commerce are all independently converging on this is no coincidence. It's a structural gap making itself known. A Behavioral Observer You Can Drop In This Week The pattern is a sampling observer sitting in your serving layer. Replace _score() with RAGAS faithfulness, embedding cosine similarity, or an LLM-as-judge evaluator, depending on your quality rubric. The heuristic below is a working default: groundedness (how much of the response is anchored in retrieved docs) minus a penalty for hedging language that signals confidence erosion. Python import random class BehavioralObserver: def __init__(self, sample_rate=0.05, drift_threshold=0.15, baseline_size=50): self.sample_rate = sample_rate self.drift_threshold = drift_threshold self.baseline_size = baseline_size self.scores = [] self.baseline = None def observe(self, prompt, response, context): if random.random() > self.sample_rate: return score = self._score(response, context) if self.baseline is None: # Phase 1: build baseline self.scores.append(score) if len(self.scores) >= self.baseline_size: self.baseline = sum(self.scores) / len(self.scores) return drift = self.baseline - score # Phase 2: detect drift if drift > self.drift_threshold: print(f"[DRIFT ALERT] score={score:.3f} baseline={self.baseline:.3f} drift={drift:.3f}") # pagerduty.trigger(...) or datadog.metric("ai.behavioral.drift", drift) def _score(self, response, context): doc_words = set(" ".join(context.get("retrieved_docs", [])).lower().split()) terms = response.lower().split() groundedness = len([t for t in terms if t in doc_words]) / max(len(terms), 1) hedges = ["i think", "not sure", "might be", "possibly"] return max(groundedness - sum(0.05 for h in hedges if h in response.lower()), 0.0) # Drop in: observer = BehavioralObserver() def serve(prompt, context): response = your_llm_call(prompt, context) observer.observe(prompt, response, context) return response Two things worth knowing. The 5% sample rate catches degradation without adding latency, at high traffic, even 1% gives you a statistically robust signal. The baseline lock after 50 samples is deliberate: running behavioral chaos against an unlocked baseline is like running load tests before you've measured normal traffic. 5 Behavioral Chaos Experiments to Run After Your Next Infrastructure Suite These aren't replacements for your existing chaos experiments. They're additive — run them after your infrastructure suite, with behavioral scoring as the exit criterion rather than uptime recovery. ExperimentWhat you injectWhat it testsExit criterionStale embedding injectionReplace embeddings with a 14-day-old snapshotRetrieval precision under stale indexScore within 5% of baseline across 50 sampled promptsPartial index degradationRemove 30% of documents from the vector storeGraceful degradation in retrieval recallHallucination rate stays flat vs. baselineContext window truncationTruncate retrieved context to 40% of normalReasoning quality under a constrained contextGroundedness score stays above thresholdAgent handoff latency injectionAdd 800ms delay between agent hopsMulti-agent coherence under degraded commsEnd-to-end intent preserved across all hopsMemory poisoning simulationInject one factually wrong document into the retrieval storeRAG faithfulness under adversarial dataThe system identifies or flags the conflicting document Define the exit criterion before you inject the failure. That's the same discipline your infrastructure chaos practice demands for SLO-based rollback conditions; it applies here too. What the Field Is Actually Saying Vitaly Yago, CEO of PhotoGov, described the shift his team made after hitting this wall in production: "We began implementing chaos for behavior, not just for infrastructure. Instead of testing whether the system will recover, we test whether the quality of decisions is maintained under noise, data changes, and successive updates." John Russo, VP of Healthcare Technology Solutions at OSP Labs, came to the same realization after behavioral degradation appeared in a clinical AI workflow that had passed every infrastructure check: "It is no longer just about systems staying up, it is about systems staying correct under stress." Two engineers, two completely different industries, same conclusion. The field has moved on from the question of whether AI systems survive failure. The question it's now wrestling with, without a good answer yet at scale, is whether they reason correctly after failure. The chaos engineering discipline has fifteen years of hard-won tooling for testing the first question. It has almost nothing for the second. That's not a criticism of the existing tools. It's a signal that the discipline needs to grow a second layer. The practitioners whose experiences shaped this article are already building it in production, because the failures forced them to. The only question for your team is whether you discover your agentic system's behavioral limits through a chaos experiment you designed, or a production incident you didn't see coming. The Short Version: Three Things to Add Before Your Next Chaos Run Lock a behavioral baseline first. Sample 50–100 representative inputs and store expected outputs before injecting any failure. Your chaos experiments now have a behavioral exit criterion, not just infrastructure recovery metrics.Make retrieval precision a first-class signal. The most common failure vector across the teams I spoke with was RAG degradation invisible to standard monitoring. Retrieval precision scoring belongs alongside latency and error rate on your dashboards.Log reasoning chains, not just outputs. For multi-agent pipelines, log the reasoning path each agent used to produce its output. When that structure changes without a deployment event triggering it, that's your behavioral alert, the equivalent of a latency spike, but for the quality of reasoning.

By Sayali Patil
Feature Flag Debt: Performance Impact in Enterprise Applications
Feature Flag Debt: Performance Impact in Enterprise Applications

Feature flags have become standard practice in enterprise applications, enabling teams to release code into production environments without exposing new features to users. As teams leverage feature flags to increase delivery velocity, technical debt accumulates. Left unchecked, this debt will slowly and silently impact application performance, maintainability, and developer productivity. What Is Feature Flag Debt? Feature flag debt occurs when feature flags are left in the codebase after they’ve served their purpose. The most common symptoms of feature flag debt include: Dead code Context switching for developers Feature flag debt can go unnoticed because it typically doesn’t cause broken features. As a result, developers are often reluctant to clean up flags so they can focus on developing new features. Impact on Performance Feature flag debt can have serious consequences for application performance. In front-end applications, this is often overlooked. Once a feature flag has been introduced into a codebase, it incurs a long-term cost every time the application is loaded in the browser. Larger JS bundles: Each feature flag adds logic to the application. When feature flags are not cleaned up, the associated code is typically not removed from the final bundled app. This means more code for users to download and more memory used on the client.Reduced execution speed in client-side rendering: The browser must download, parse, and evaluate the entire bundle, even if certain code paths are never executed. This leads to slower parsing, longer load times, and slower interaction time. Impact on Developer Productivity Feature flag debt also negatively impacts developer productivity. Imagine having to read through an if/else statement that checks a feature flag that will never be true. Developers frequently encounter this scenario when working with feature flags. New engineers, in particular, often struggle to know which feature flags are safe to ignore. Should they be commenting out this code? What if they need it later? Why Aren’t Feature Flags Cleaned Up? It should be standard practice to remove feature flags from the codebase once they’re no longer needed. However, they often become a long-term liability for the application for several reasons: Nobody takes responsibility for cleaning up flags.People are afraid to remove code.There are no tools to help automate the process.There’s always something more pressing to work on. We often don’t see a defined feature flag lifecycle, which leads to indefinite accumulation. Example of Feature Flag Debt For example, let’s take a look at how a feature would typically look when wrapped in a feature flag: JavaScript const isAIAgentsFeatureFlagEnabled = isFeatureEnabled('ai-agents'); if (isAIAgentsFeatureFlagEnabled) { // lines of code // Code to run when the feature flag is enabled } else { // lines of code // Code to run when the feature flag is disabled } When first implemented, this doesn’t look too bad. When this feature is rolled out to production, there’s still the safety net of keeping the original functionality should something go wrong. However, after the feature flag is turned on for everyone and the feature reaches general availability (GA), there is no reason to keep both pathways in the application. The application still ships both pieces of code in the bundle, but only one will ever execute at runtime. The else block now represents dead code that will not get executed, but still takes up space in the bundle and adds to code complexity. Manage and Eliminate Feature Flag Debt Organizations need to take measures to prevent feature flag debt from slowing down their applications. Defining a feature flag life cycle is a great place to start. By enforcing that each feature flag has a description, owner, status, and expiration date, the team can ensure flags aren’t left to become debt. Treat feature flags as temporary and not part of the application's core architecture. When the feature is in GA, remove the flag and delete any code paths that are no longer needed. This results in a cleaner, more maintainable, and performant codebase. JSON [ { "feature_flag_name": "ai-agents", "description": "Feature flag that will allow AI agents to assist users with workflows and provide suggestions", "owner": "architecture crew", "status": "GA", "expiration_date": "2026-12-31" }, { "feature_flag_name": "smart-checkout", "description": "Feature flag that will allow smart checkout features, including dynamic pricing, custom offers", "owner": "architecture crew", "status": "Dev", "expiration_date": "2026-12-31" }, { "feature_flag_name": "ai-agents-eval", "description": "Feature flag to allow the evaluation framework to execute tests against AI agents to determine how accurate they are", "owner": "agent evaluation crew", "status": "QA", "expiration_date": "2026-10-12" }, { "feature_flag_name": "experiment-recommendation-v2", "description": "Feature flag for experimenting v2 recommendation version", "owner": "agent evaluation crew", "status": "GA", "expiration_date": "2026-12-31" } ] Having the feature flags stored in a format similar to the above can help identify who to contact to clean up old flags. Performance Gains From Cleanup Removing unused feature flags reduces bundle size and eliminates unnecessary code execution, resulting in faster load times, improved rendering performance, and a cleaner codebase. Conclusion For most enterprise applications, feature flags aren’t the problem; it’s forgetting to take them down. As the application grows over time, old feature flags accumulate, which will silently bloat the bundle size, degrade performance, and clutter the code.

By Poornakumar Rasiraju
When Perfect Data Breaks: The Journey from Data Quality to Data Observability
When Perfect Data Breaks: The Journey from Data Quality to Data Observability

The Day Everything Looked Fine — Until It Wasn’t The dashboards were green. Every test passed. And yet, by morning, the company’s revenue had mysteriously dropped by roughly $1 million. The data team huddled together, blinking at their screens. Schema checks? It looked good.Nulls? Checks passed, and everything appeared to be in order.Completeness? It looked good. Nothing looked wrong, except that something was causing the business to bleed. What they didn’t know yet was that an innocent iOS app update had quietly scrambled the order of user events. To the system, customers were suddenly purchasing before browsing. The models didn’t break in code; they broke in meaning. The team discovered a crucial lesson: even flawless data systems can mislead without true observability. Why “Good Data” Isn’t Good Enough Anymore There was a time when data quality was the gold standard and a measure of success. DQ checks meant your dataset is protected. If your dataset were clean, complete, and validated, your insights would be gold. But that was back when pipelines were simple, ETL jobs ran once a night, and life was predictable. Back then, most data was read by people, not systems. Analysts looked at dashboards after the fact, asked questions when numbers felt off, and applied judgment before anyone made a real decision. If a table landed late or a metric looked strange, someone usually noticed; often before it caused real damage. Data quality checks were designed for this world: static, batch-oriented, and tolerant of human interpretation. But as technology changed, so did expectations. Today’s world is different. This shift matters most for data engineers, analytics engineers, and platform teams responsible for the reliability of downstream dashboards, APIs, and machine learning systems. Modern cloud-native companies run thousands of interdependent batch and streaming pipelines, constantly feeding dashboards, APIs, and machine learning systems. A single column rename, a delayed partition, or an unnoticed schema tweak can quietly throw everything off course. Traditional data quality is like checking your car’s oil once a month. Data observability involves installing a dashboard that provides real-time alerts when the engine is overheating. The Shift: From Data Quality to Data Observability Data quality answers the question: “Is this dataset correct right now?” Data observability asks something deeper: “Is my data behaving as it should?” Aspect Data Quality Data Observability Focus Data-at-rest Data-in-motion Checks Accuracy, completeness, validity Freshness, volume, distribution, schema, lineage When Point-in-time Continuous Goal Ensure correctness Ensure reliability View Local End-to-end The Five Pillars of Data Observability Freshness: Is data arriving on time relative to SLAs?Volume: Are record counts within expected ranges?Distribution: Have key statistics (e.g., averages, percentiles) drifted unexpectedly?Schema: Did upstream fields change without notice?Lineage: What depends on what, and who owns it? Together, these pillars act as an early-warning system for your data ecosystem, sensing changes before they cause downstream impact. The Story Behind the $1M Drop Our e-commerce company’s recommendation engine accounted for 40% of revenue. After a routine app update, click-throughs fell by 15%, conversions by 22%, and revenue tumbled. And yet, all quality checks still passed. Check Status Missed Insight Schema ✅ Timestamps changed meaning Nulls ✅ Events arrived out of sequence Ranges ✅ Valid values, wrong order Data quality confirmed the structure. It missed the story. Event order sounds like a minor detail, but for recommendation models, it’s foundational. Browsing before purchasing means something very different than purchasing before browsing. When that sequence flipped, nothing crashed; the model simply learned the wrong story about customers. Since the data remained complete, valid, and schema-compliant, every traditional check passed, even as the model’s understanding of user behavior quietly unraveled. The Hidden Issue The iOS app began batching events. They arrived six hours late and out of order. Before (Healthy) After (Broken) View → Add to Cart → Purchase Purchase → View → Add to Cart The model interpreted chaos as logic, and that’s when recommendations became noise. How Observability Would Have Saved the Day Within two hours, an observability system would have screamed: Freshness Alert: Event lag jumped from 5 mins to 360 minsDistribution Alert: 78% of events out of sequenceLineage Alert: iOS v1.3.0 deployed, impacting 47 tables and degrading 12 ML models Approach Detection Root Cause Resolution Time Data Quality Missed Undetected 3 days Data Observability Caught early iOS v1.3.0 deployment 6 hours Observability didn’t just find the broken data; it connected the dots to the moment things went wrong. The real win wasn’t just catching the issue faster. It was knowing exactly what changed, when it changed, and how far the damage spread. That made it possible to roll back quickly and explain what happened without guesswork. Without observability, teams debate symptoms. With it, they start acting on causes. Building Observability Step by Step So how does a modern data team move from reactive firefighting to proactive confidence? 1. Define Data Contracts Every dataset has a clear, versioned schema (YAML, Avro, Protobuf). Contracts live in code and are automatically validated before pipeline runs and new data is added to the dataset. Data contracts are often the first thing teams skip. They feel slow, bureaucratic, and unnecessary, right up until a breaking change slips through and every downstream table starts lying. 2. Add Freshness & Volume Monitors Track how long data takes to arrive and whether counts fall outside norms. Row updated at timestamp should be within the defined SLO. Define SLOs such as “99% of partitions land within 10 minutes.” Without explicit SLAs, delays are only discovered after dashboards update or don’t. By then, decisions have already been made on stale data. 3. Strengthen Tests Layer dbt checks for `not_null` and `uniqueness` with drift tests — e.g., “average session_length stays within 10% of baseline,” or “count of new orders placed stays within 10% of the baseline.” Basic checks are good at catching broken tables, but they don’t tell you when data starts behaving differently. Drift tests exist for the uncomfortable cases where everything looks valid but isn’t. 4. Emit Lineage Integrate OpenLineage with Airflow or dbt to visualize dependencies and trace impact instantly. Without lineage, every alert triggers a manual investigation. With it, teams can immediately see blast radius and ownership. 5. Centralize Visibility Bring all signals into one pane of glass. When freshness lives in one tool, lineage in another, and alerts in Slack, every incident turns into a scavenger hunt. Pulling those signals together is what turns alerts into answers. Now, when an alert fires, you know what broke, where, and who’s responsible. A Familiar Pattern If this story sounds familiar, it’s because it’s happening everywhere. Teams at Netflix have described recommendation quality degrading after upstream data schemas changed without downstream safeguards.Uber has publicly discussed timezone-related bugs that impacted time-based systems, including pricing and incentives.Airbnb has shared incidents where aggressive deduplication and data-cleaning logic removed valid records.Stripe has written extensively about how tiny currency-rounding errors can quietly compound into material financial discrepancies at scale.Different problems, same root cause: great data quality, no visibility. Let’s Distill the Lesson: Quality Validates. Observability Protects. Data quality ensures your data is correct. Data observability ensures your system stays trustworthy. In today’s interconnected world, where every pipeline is a domino, observability isn’t a luxury; it’s a seatbelt. So the next time your dashboard shows that comforting little green badge labeled “Fresh & Verified,” remember: behind that glow lies a safety net of observability quietly keeping your business upright.

By Divyakumar Savla
One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes
One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes

TL;DR A single straggling node held up a 4-node distributed training job. We found it by fanning out one SQL query to all four nodes and getting the answer in under a second. This is distributed GPU training debugging with eBPF – no central service, no Prometheus, no time-series database, just the same single-binary agent already running on each machine. The Problem We Kept Hitting We’ve been building Ingero — an eBPF agent that traces CUDA API calls and host kernel events to explain GPU latency. Until v0.9, it was single-node only. Trace one machine, explain what happened on that machine. For single-GPU inference or training, that worked well. But distributed training spreads the debugging surface across machines. When a 4-node DDP job slows down, the question is always: which node? And then: why? nvidia-smi on each machine reports healthy utilization. dstat shows nothing obvious. The typical workflow is SSH-ing into each box, eyeballing logs, diffing timestamps across terminals, and hoping the issue is still happening. We wanted a cross-node investigation without adding infrastructure. The question was: what’s the simplest architecture that works? What We Shipped in v0.9.1 Three features, all built on top of the existing per-node agent. No new services, no new daemons, no new ports. 1. Node Identity Every event now carries a node tag. The agent stamps each event with a name from a --node flag, an ingero.yaml config value, or the hostname as fallback: Shell sudo ingero trace --node gpu-node-01 Event IDs become node-namespaced (gpu-node-01:4821) so databases from different nodes can merge without collisions. For torchrun workloads, rank and world size are auto-detected from environment variables (RANK, LOCAL_RANK, WORLD_SIZE) — no extra configuration needed. 2. Fleet Fan-Out Queries Each Ingero agent already exposes a dashboard API over HTTPS (TLS 1.3, auto-generated ECDSA P-256 cert if no custom cert is provided). The new fleet client sends the same query to every node in parallel, collects the results, and concatenates them with a node column prepended. For production clusters, the client supports mTLS — --ca-cert, --client-cert, --client-key — so both sides authenticate. Plain HTTP is available via --no-tls but requires an explicit opt-in, and even then, it’s intended for trusted VPC networks only. The --nodes flag works for ad-hoc queries, but for anything beyond a handful of nodes, the node list goes into ingero.yaml once and every command picks it up automatically: YAML fleet: nodes: - gpu-node-01:8080 - gpu-node-02:8080 - gpu-node-03:8080 - gpu-node-04:8080 A full example config is in configs/ingero.yaml. Here’s what it looked like when we ran it against a 4-node cluster where one node was misbehaving: Shell $ ingero query --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080 \ "SELECT node, source, count(*) as cnt, avg(duration)/1000 as avg_us FROM events GROUP BY node, source" node source cnt avg_us ---------------- ------ ----- ------ gpu-node-01 4 11009 5.2 gpu-node-01 3 847 18400 # ← 9x higher than peers gpu-node-02 4 10892 5.1 gpu-node-02 3 412 2100 gpu-node-03 4 10847 5.3 gpu-node-03 3 398 1900 gpu-node-04 4 10901 5.0 gpu-node-04 3 421 2200 8 rows from 4 node(s) Node 1 jumps out immediately: 847 host events at 18.4ms average, while the other three sit around 2ms. One more command to see the causal chains: Shell $ ingero explain --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080 FLEET CAUSAL CHAINS - 2 chain(s) from 4 node(s) [HIGH] [gpu-node-01] cuLaunchKernel p99=843us (63.9x p50) - 847 sched_switch events + heavy block I/O Root cause: 847 sched_switch events + heavy block I/O Fix: Pin training process to dedicated cores with taskset; Add nice -n 19 to background jobs [MEDIUM] [gpu-node-01] cuMemAlloc p99=932us (5.0x p50) - 855 sched_switch events + heavy block I/O Root cause: 855 sched_switch events + heavy block I/O Fix: Pin training process to dedicated cores with taskset Both chains are on gpu-node-01. The other three nodes have zero issues. The root cause: CPU contention from block I/O — checkpoint writes preempting the training process. Two commands to go from “distributed training is slow” to “pin the training process on node 1 and investigate the I/O source.” 3. Offline Merge and Perfetto Export Not every environment allows live HTTP queries between nodes. Air-gapped clusters, locked-down VPCs, compliance constraints — there are real reasons the network path isn’t always available. For those cases, ingero merge combines SQLite databases from each node into a single queryable file: Shell # 1. Collect traces from each node scp gpu-node-01:~/.ingero/ingero.db node-01.db scp gpu-node-02:~/.ingero/ingero.db node-02.db # 2. Merge and analyze ingero merge node-01.db node-02.db -o cluster.db ingero explain -d cluster.db Stack traces are deduplicated by hash. Events keep their node-namespaced IDs. Old databases that predate the node column work with --force-node. For visual timeline analysis, ingero export --format perfetto produces a Chrome Trace Event Format JSON that opens in ui.perfetto.dev. Each node gets its own process track. Causal chains show up as severity-colored markers. The straggler is visible at a glance in the timeline. Why We Built It This Way The obvious approach to multi-node observability is a central collector: ship events to a time-series database, build dashboards, set up alerts. Prometheus, Datadog, Honeycomb — the well-trodden path. We deliberately avoided that. No new infrastructure. Ingero is a zero-config, single-binary agent with no dependencies. Adding a central collector contradicts that. The fleet client is 400 lines of Go in the existing binary. It reuses the HTTPS API the agent already exposes. Nothing new to deploy, nothing new to secure — the same TLS 1.3 + mTLS configuration that protects a single node’s dashboard protects the entire fleet. Client-side fan-out is simple and sufficient. The CLI sends concurrent HTTP requests, collects results, and merges them locally. A sync.WaitGroup, some JSON decoding, column concatenation. No distributed query planning, no consensus protocol, no coordinator election. For 4-50 nodes, this is the right level of complexity. Partial failure is first-class. If one node is unreachable, results from the others still come back, plus a warning. No all-or-nothing semantics. In practice, the unreachable node is often the one in trouble — and knowing which nodes failed is diagnostic information in itself. Clock skew is measured, not ignored. eBPF timestamps come from bpf_ktime_get_ns() (CLOCK_MONOTONIC), which is per-machine. When correlating events across nodes, clock differences matter. The fleet client runs NTP-style offset estimation in parallel with the actual query — 3 samples per node, median filter. On a typical LAN with sub-millisecond RTT, precision should be well under 10ms. If skew exceeds a threshold, it warns. This adds zero latency since it runs concurrently with the data query. Offline merge covers air-gapped environments. Some production GPU clusters have no internal HTTP connectivity between nodes. SCP the databases, merge locally, investigate. The merge path also serves as a permanent record of the cluster state at investigation time. MCP: AI-Driven Fleet Investigation The fleet is also accessible through Ingero’s MCP server via the query_fleet tool. Here’s what the raw tool output looks like for a chains query across the same 4-node cluster: Python query_fleet(action="chains", since="5m") Fleet Chains: 2 chain(s) [HIGH] gpu-node-01 | cuLaunchKernel p99=843us (63.9x p50) | 847 sched_switch events + heavy block I/O [MEDIUM] gpu-node-01 | cuMemAlloc p99=932us (5.0x p50) | 855 sched_switch events + heavy block I/O That’s the complete response — an AI assistant gets this back from one tool call, no SSH access to each node, no manual SQL. The tool supports four actions: chains (causal analysis), sql (arbitrary queries), ops (operation breakdown per node), and overview (event counts). Clock skew warnings are prepended automatically when detected. Where This Stands v0.9.1 is the initial step in cluster-level tracing, not the destination. What we have now works well for the reactive investigation workflow: something went wrong, we need to find out what and where. Fan-out queries, offline merge, Perfetto export — these are diagnostic tools for after the fact. We’re actively working on cross-node correlation and straggler detection — more updates coming soon. And since the instrumentation sits on host-level eBPF rather than vendor-specific hooks, none of this is limited to a specific GPU vendor. The bet is that client-side fan-out scales to 50+ nodes before anything centralized is needed. When it doesn’t, the node-namespaced ID scheme and offline merge path ensure the architecture can evolve without breaking existing deployments. We’re stress-testing the fan-out architecture against larger clusters and would welcome feedback from teams running multi-node training. Open an issue on GitHub. The investigations/ directory has ready-to-query databases for trying this without a GPU cluster: sample-gpu-node-01.db, sample-gpu-node-02.db, sample-gpu-node-03.db – individual node traces from a 3-node clustersample-cluster.db – all three merged into one (600 events, 6 chains, 9 stacks) GitHub (give us a star!): github.com/ingero-io/ingero. No NVIDIA SDK, no code changes, production-safe by design. If you are facing distributed training issues in your own workloads, we’d love to take a look. Drop an issue on GitHub, and we will gladly dive into it together. Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead. Related Reading GPU incident response in 60 seconds with eBPF – single-node investigation workflow that the fleet feature extends11-second time to first token on a healthy vLLM server – kernel-level scheduling contention causing hidden latency, similar to the straggler root cause in this postGPU showing 97% utilization while training runs 3x slower – why nvidia-smi metrics alone miss the real story

By Ingero Team
A Scalable Framework for Enterprise Salesforce Optimization: Turning Outcomes Into an Operating System
A Scalable Framework for Enterprise Salesforce Optimization: Turning Outcomes Into an Operating System

Large Salesforce programs often ship features without moving the metrics that matter. This article presents a five‑layer operating model — intake, process/data contracts, configuration‑first delivery, risk‑aligned releases, and telemetry‑driven adoption — that helps software delivery teams and product leaders consistently achieve double‑digit improvements in cycle time and operational efficiency in regulated, multi‑cloud environments. Who This Article Is For Product Owners / Technical Program Leads who own value realization for Salesforce initiatives.Architects / Platform Owners driving org hygiene, multi‑cloud consistency, and integration stability.BA/QA Leads responsible for acceptance criteria, test design, and change traceability. Why Big Salesforce Programs Underperform (and How to Fix Them) Most programs stall for business reasons, not technical ones: Ad‑hoc intake → Priorities are shaped by volume and urgency rather than measurable value.Process drift → Local variants multiply; reporting becomes unreliable.Output‑centric governance → Teams celebrate story points, not cycle time, first‑time‑right, or adoption.One‑and‑done enablement → Users are told, not enabled; behavior doesn’t change, value doesn’t land. These patterns appeared — despite industry differences — on programs I supported at a federal home loan bank, a major academic medical center, a Fortune‑ranked healthcare distributor, and a global manufacturer/partner ecosystem. The antidote is an outcomes‑first operating system that is simple to run, easy to audit, and fast to scale. The Five‑Layer Operating Model (Business/Functional Edition) 1) Unified Intake With Measurable Outcomes 2) Process Blueprint + Data Contract 3) Configuration‑First Delivery (with narrow, justified exceptions) 4) Risk‑Aligned Release & Change Governance 5) Adoption, Telemetry, and Monthly Value Reviews Think of these as five standing conversations led by product and process owners. You’ll iterate across all five in parallel. 1) Unified Intake With Measurable Outcomes What changes: Replace scattered requests with a single backlog (run by Product/PMO) where every item carries a baseline and a target metric (e.g., “Reduce opportunity creation time from 3:00 to 0:20 for frontline sellers”). Why it works: Scope trade‑offs become rational when tied to a metric leadership cares about. This discipline preceded a ~90% reduction in opportunity creation time in a banking program because the team optimized towards a number — not a feature list. Deliverables: Intake template (baseline, target, personas, dependencies); quarterly objective slate with two outcome KPIs. 2) Process Blueprint + Data Contract What changes: Before configuration, business owners align on the future process and data contract: required fields, allowed values, ownership, lineage, and service‑level expectations across systems. Why it works: Deterministic process and data decisions prevent local variants that destroy reporting and controls. At a major healthcare provider, this clarity contributed to 20–30% improvements in execution efficiency by eliminating rework and stabilizing hand‑offs. Deliverables: One‑page process map, data dictionary for key objects, RACI for data ownership, event boundaries (who creates/updates what, when). 3) Configuration‑First Delivery (With Narrow, Justified Exceptions) What changes: Default to configuration patterns (record types, dynamic forms, orchestration, assignment rules) and reuse shared building blocks. Escalate to customization only when a regulatory, performance, or logic boundary requires it — and only when tied to an approved outcome. Why it works: Config‑first keeps the org maintainable, enables faster iteration, and reduces total cost of ownership. On experience programs, this discipline enabled a 40% increase in partner engagement and a 70% reduction in manual entry, because teams could release smaller improvements frequently — and keep them consistent. Deliverables: Configuration‑first charter; exception log with business justification; reuse catalog (what already exists that we can extend). 4) Risk‑Aligned Release & Change Governance What changes: Move to predictable release trains (e.g., every two weeks) with UAT scripts tied to the outcome metrics defined at intake. In regulated contexts, incorporate change advisory inputs and rollback plans. Separate feature deployment from enablement (e.g., role‑based activation, staged access). Why it works: Predictability reduces fire drills and protects operations. In financial services, hardening releases and integration touchpoints with core platforms allowed operations to realize a ~30% efficiency improvement due to fewer errors and rework. Deliverables: Release calendar, outcome‑mapped UAT pack, change checklist, enablement toggle plan. 5) Adoption, Telemetry, and Monthly Value Reviews What changes: Treat adoption and measurement as part of the work. Provide role‑specific enablement (micro‑videos, checklists, guided tours). Stand up dashboards that track the two objectives selected each quarter (e.g., cycle time, first‑time‑right, utilization by persona). Hold a monthly value review to compare baseline vs. actual and re‑prioritize. Why it works: When value is visible, stakeholders align quickly, and teams get cover to simplify instead of endlessly bolting on. This cadence supported 25% faster delivery on subsequent releases, because the backlog reflected telemetry — not anecdotes. Deliverables: Adoption plan by persona, live dashboard spec, monthly value review agenda. Conclusion Enterprise Salesforce delivery thrives when software development is governed by measurable outcomes, deterministic processes and data, configuration‑first design, predictable releases, and telemetry‑led adoption. This five‑layer operating model turns ambiguous demand into testable change and compounds improvements across quarters. Start with one journey, set two KPIs, and let the evidence guide your next sprint.

By Pulkit Singhal
AWS Managed Database Observability: Monitoring DynamoDB, ElastiCache, and Redshift Beyond CloudWatch
AWS Managed Database Observability: Monitoring DynamoDB, ElastiCache, and Redshift Beyond CloudWatch

A DynamoDB throttle alarm fires at 2 am. You confirm the spike in CloudWatch, then check ElastiCache in a second dashboard, then Redshift in a third. Cache hit rate dropped, which hammered DynamoDB, which stalled the zero-ETL export. Three services, three dashboards, one cascade you can only trace by hand. This guide maps the specific metrics, alarm thresholds, and configuration steps for each service, and then addresses the observability delta that CloudWatch leaves unresolved: cross-service correlation, root-cause traceability, and the capacity-planning intelligence that prevents cascades in the first place. What CloudWatch Gives You Across DynamoDB, ElastiCache, and Redshift Prerequisites: The CLI examples and alarm configurations in this guide assume AWS CLI v2, an IAM principal with cloudwatch:GetMetricData, cloudwatch:PutMetricAlarm, and dynamodb:UpdateContributorInsights permissions, and active DynamoDB tables, ElastiCache clusters, or Redshift clusters in your account. CloudWatch publishes metrics for all three services under service-specific namespaces. Per the AWS CloudWatch documentation, metric retention runs in three tiers: 1-minute data points retained for 15 days, 5-minute data points for 63 days, and 1-hour data points for 455 days. NamespaceCategoryKey MetricsAWS/DynamoDBCapacityConsumedReadCapacityUnits, ConsumedWriteCapacityUnits, ThrottledRequestsAWS/DynamoDBLatencySuccessfulRequestLatency (p50, p99)AWS/DynamoDBHealthSystemErrorsAWS/ElastiCacheEfficiencyCacheHitRate, EvictionsAWS/ElastiCacheMemoryDatabaseMemoryUsagePercentageAWS/ElastiCacheConnectionsCurrConnections, ReplicationLagAWS/RedshiftPerformanceQueryDuration, QueryQueueTimeAWS/RedshiftWorkloadWLMQueueLength (per queue)AWS/RedshiftResourcesCPUUtilization, ReadIOPS, WriteIOPS For most post-incident investigations, you’ll hit the granularity boundary within two weeks. A throttle spike that lasted 4 minutes on day 17 shows up as a single 5-minute average data point, frequently indistinguishable from normal traffic variation. The per-custom-metric cost also compounds at scale: an account running 40 DynamoDB tables, 6 ElastiCache clusters, and 3 Redshift clusters with per-resource custom alarms can accumulate hundreds of CloudWatch metrics across namespaces, each costing $0.30/month to store and $0.10/alarm/month to evaluate. Each namespace provides enough signal to diagnose its own service, but CloudWatch publishes no native cross-service correlation mechanism. A ThrottledRequests spike in AWS/DynamoDB and a CacheHitRate collapse in AWS/ElastiCache at the same timestamp are both visible, but connecting them as cause and effect requires a human to match timestamps across dashboards. DynamoDB: Throttling Detection, Partition Health, and Capacity Mode Decisions DynamoDB throttling is rarely a single-metric problem. A throttle alarm tells you capacity was exceeded, but not whether the cause is a hot partition, an undersized provisioned table, or a traffic pattern that outgrew your capacity mode. The subsections below work through that diagnostic sequence: the metrics that surface the symptom, the tooling that pinpoints the partition, and the capacity decision that prevents recurrence. Core Metrics and Alarm Thresholds The DynamoDB CloudWatch metric namespace publishes table-level aggregates. For provisioned-capacity tables, these five metrics drive operational decisions: MetricUnitRecommended Alarm ThresholdNotesThrottledRequestsCount> 0 (provisioned mode)Any throttling on a provisioned table means capacity is misconfigured or a hot partition is concentrating loadSuccessfulRequestLatency p99Milliseconds> 10ms (read-heavy workloads); > 20ms (mixed)p99 > 10ms on reads is a practitioner-recommended leading indicator of partition pressure before throttles appearConsumedReadCapacityUnitsCount/second> 80% of provisioned RCUsSignals you’re approaching throttle territoryConsumedWriteCapacityUnitsCount/second> 80% of provisioned WCUsSame logic for write-heavy workloadsSystemErrorsCount> 0Indicates DynamoDB service-side failures, distinct from capacity limits Practitioner-recommended starting points. Tune to your workload characteristics. ThrottledRequests at table level confirms that throttling happened, but tells you nothing about which partition caused it. On a table with millions of items, a single access pattern (a user ID acting as a partition key hot spot, for instance) can drive 95% of throttles while aggregate consumed capacity looks healthy. DynamoDB Contributor Insights resolves this. Contributor Insights for Hot Partition Detection DynamoDB Contributor Insights surfaces the top-N most-accessed partition keys and sort keys in real time. It identifies the specific items driving throttling or high latency that pure CloudWatch metric aggregation can’t surface. Enabling it on a production table with significant traffic incurs cost (priced per request evaluated), but during a throttle incident, Contributor Insights gives you the specific key value generating excess load rather than an aggregate curve. Enable it from the DynamoDB console under the table’s “Monitor” tab, or via CLI (requires AWS CLI v2+): Plain Text aws dynamodb update-contributor-insights \ --table-name YOUR_TABLE_NAME \ --contributor-insights-action ENABLE Once active, CloudWatch Logs Insights receives partition-level data within minutes. Query the top-10 most-accessed partition keys over the past hour to confirm whether a hot key is generating the throttle alarm: Plain Text filter @message like /ContributorInsights/ | stats count(*) as accessCount by partitionKey | sort accessCount desc | limit 10 Capacity Mode Decision Logic The decision between provisioned and on-demand capacity modes depends on traffic predictability. Use a 7-day ConsumedCapacityUnits trend as your input signal: If consumed capacity stays below 80% of provisioned capacity and follows a consistent daily pattern, stay on provisioned. Set auto-scaling target utilization at 70% of provisioned capacity to leave headroom for traffic spikes before throttling begins.If consumed capacity regularly exceeds 80% of provisioned, or if usage patterns show irregular spikes with no predictable shape, on-demand mode eliminates throttling risk at a higher per-request cost. Teams running the DynamoDB zero-ETL integration with Redshift (GA October 2024) face a different monitoring angle from streaming replication. The integration operates via periodic incremental exports every 15 to 30 minutes, so source table latency doesn’t affect export timing. The primary constraint on analytics data freshness is export completion status, visible in the Redshift console under the integration view. Export failures are the leading indicator of stale analytics data. ElastiCache: Cache Efficiency, Memory Pressure, and the Valkey 8.0 Observability Upgrade When cache hit rate drops, the blast radius extends beyond ElastiCache. Every cache miss becomes a direct read against your origin datastore, and if that origin is a DynamoDB table already running near provisioned capacity, you get the throttle cascade from the introduction. The metrics below separate cache-level symptoms from the memory and replication signals that predict them, followed by the observability improvements Valkey 8.0 brings. Redis and Valkey Metrics Per the ElastiCache CloudWatch documentation, the metrics that drive operational decisions for Redis and Valkey deployments are: MetricTargetAlert ThresholdActionCacheHitRate>= 0.95< 0.90Investigate at < 0.90; below 0.80 indicates a significant access pattern change or deployment that altered cache key patternsEvictions~0 (steady state)> 100/min sustainedSustained evictions mean maxmemory-policy is evicting live data under memory pressureDatabaseMemoryUsagePercentage< 70%Alert at > 75%; scale-out at > 85%Alert at 75% gives runway to analyze dataset growth; above 85% triggers automatic evictions under most policiesReplicationLag< 100ms> 500msReplica lag at this level affects read scaling reliabilityCurrConnectionsWorkload-specific> 80% of max allowedPersistent near-limit connections indicate a connection pool misconfiguration or application-side leak Practitioner-recommended starting points based on operational experience. Memcached deployments within ElastiCache expose a different metric set through the same AWS/ElastiCache namespace: get_hits and get_misses (from which you derive hit rate), evictions, and bytes_used vs. limit_maxbytes. Valkey and Redis are cluster-based architectures with native replication, while Memcached is a horizontally partitioned cache with no native replication. Applying Redis/Valkey thresholds to Memcached deployments produces misleading alarms. Valkey 8.0 Observability Additions The open-source Valkey 8.0 release shipped from the Linux Foundation on September 16, 2024. Amazon ElastiCache 8.0 for Valkey launched on November 21, 2024, bringing four observability primitives that prior Redis OSS metrics on ElastiCache didn’t expose. Per-slot metrics let you identify which hash slots carry disproportionate traffic across a cluster. Before Valkey 8.0, CloudWatch surfaced per-node and per-cluster aggregates only. A slot-level throughput imbalance (common after a key pattern change in the application layer) was invisible until it produced node-level CPU or memory pressure. With per-slot metrics, you detect the asymmetry before it cascades to node-level saturation. Per-client event loop latency tracks how long each client connection waits in the event loop queue. This directly diagnoses client-specific throughput asymmetries. If one application service has a misconfigured connection pool producing tail latency that appears as a CacheHitRate degradation from another service’s perspective, per-client event loop latency identifies the offending client specifically rather than surfacing a cluster-level aggregate that implicates everything. Rehash memory tracking quantifies the temporary memory overhead during cluster rescaling. When you add nodes to an ElastiCache Valkey cluster, the rehashing process requires holding two copies of some hash-slot data in memory simultaneously. Before this metric, a DatabaseMemoryUsagePercentage spike during a scale-out event was ambiguous. With rehash memory tracking, you can confirm the spike is transient rehash overhead and dismiss the alarm as expected behavior rather than a capacity problem. Traffic breakdowns split read, write, and key expiry operations at the slot and node level. This replaces the single-dimensional throughput view that prior ElastiCache Redis metrics provided and enables you to identify whether a throughput increase is driven by reads, writes, or expiry churn without writing custom instrumentation. Valkey 8.1, released April 2, 2025, adds further observability improvements. Verify ElastiCache 8.1 availability in your region at the time of deployment, as managed service version availability can trail the open-source release by several weeks. Redshift: Query Performance, WLM Configuration, and Enhanced Monitoring Redshift performance problems tend to look identical from the outside: queries slow down. Whether the cause is CPU saturation, WLM slot exhaustion, or a bad query plan requires different metrics and different responses. The thresholds below separate those conditions, followed by the Enhanced Query Monitoring tooling that replaced the manual system-table workflow for root-cause diagnosis. Key CloudWatch Metrics and WLM Thresholds MetricRecommended ThresholdActionCPUUtilizationAlert at > 80%Investigate active query plans if sustained; evaluate concurrency scaling if combined with queue depthWLMQueueLength (per queue)Alert at > 3; escalate at > 5 sustained for 60 secondsWLMQueueLength > 5 sustained over 60 seconds combined with CPUUtilization > 85% is a practitioner-recommended trigger for enabling a Redshift concurrency scaling clusterQueryQueueTime> 30 secondsQueries waiting over 30 seconds indicate WLM queue saturation or slot misconfigurationQueryDuration2x the 7-day p95 baseline for that WLM queueBaseline drift detection for workload-specific thresholdsReadIOPSCluster baselineSharp ReadIOPS spikes without a corresponding query load increase can indicate full-table scans or missing sort key filters The WLMQueueLength threshold requires context to interpret correctly. A WLMQueueLength of 5 on a queue allocated 5 concurrency slots means every slot is occupied and the queue is at capacity. Combined with CPUUtilization above 85%, adding concurrency scaling capacity is the right response. WLMQueueLength of 5 with CPUUtilization at 40% points to a slot allocation problem: queries are queuing behind slot limits rather than behind compute saturation, and the fix is WLM reconfiguration, not additional nodes. Historically, diagnosing slow Redshift queries required direct access to system tables. A typical workflow queried STL_QUERY for execution times, joined to SVL_QUERY_METRICS for resource usage per execution step, and cross-referenced SVL_QUERY_SUMMARY for operator-level plan details. This three-step workflow required SQL client access, familiarity with the Redshift internal catalog schema, and significant manual correlation work. Redshift Enhanced Query Monitoring Redshift Enhanced Query Monitoring went GA on January 29, 2025, available for both Serverless and provisioned deployments. It surfaces query bottlenecks, execution plan anomalies, and resource contention at the query level through the Redshift console, removing the need for SQL-level diagnostic work against system tables. When WLMQueueLength spikes, you can go directly to a ranked list of the queries causing saturation, see their execution plan highlights, and identify whether the bottleneck is a sort key miss, a cross-join, or a network shuffle between nodes, all without writing a single STL_QUERY lookup. Redshift troubleshooting previously required a senior engineer with DBA-level knowledge of the system catalog. This change shifts basic performance diagnosis to any SRE comfortable with the console. AI-Driven Scaling and Its Monitoring Implications AWS previewed Redshift Serverless AI-driven scaling at re:Invent 2023, and it went GA in October 2024. Verify current GA status in the AWS documentation for your region before production adoption, as the preview-to-GA timeline varies by feature and region. AI-driven scaling automates RPU (Redshift Processing Unit) allocation by observing query patterns over time and adjusting base and max RPU settings to balance cost against performance. WLM queue priority, query monitoring rule configuration, and workload classification for mixed BI and ETL environments require manual configuration even on Serverless clusters running AI-driven scaling. A Redshift Serverless cluster with AI-driven scaling still requires you to define how ETL jobs and ad hoc analyst queries share resources, and which queue takes priority when both arrive simultaneously. Those decisions drive WLMQueueLength behavior regardless of how accurately the scaler provisions RPUs. Capacity Planning: Using Monitoring Data to Drive Scaling and Cost Decisions The cross-service capacity heuristic worth building into your runbooks: simultaneous DynamoDB p99 latency increase combined with ElastiCache CacheHitRate dropping below 0.90 can indicate several different conditions. Potential causes include a fan-out query change at the application layer, a cache node failure, a network event between services, or a deployment that altered cache key patterns. This symptom combination warrants application-layer investigation to confirm the root cause before deciding which service to scale. Scaling either service without confirming the shared trigger wastes capacity and can mask the actual issue. DynamoDB Build a 7-day ConsumedCapacityUnits average as your baseline, then set auto-scaling target utilization at 70% of provisioned capacity. This gives your table headroom to absorb a 30% traffic increase before auto-scaling triggers, with a further buffer before you hit throttles at 100% consumed capacity. When evaluating reserved capacity, AWS Cost Explorer surfaces DynamoDB reserved capacity recommendations with projected savings. At a 3-year term commitment, reserved capacity can save up to 77% versus provisioned capacity hourly rates. Reserved capacity makes financial sense for tables that have run in provisioned mode for at least 90 days with predictable consumption patterns. For tables with volatile or seasonal traffic, on-demand mode avoids the risk of underutilization that makes reserved capacity economically counterproductive. ElastiCache Trend DatabaseMemoryUsagePercentage over a 72-hour window. If it trends upward at a rate disconnected from traffic growth (the cache dataset is growing while the request rate stays flat), that signals cache dataset expansion rather than increased load. The operational response is node scaling before you cross the 75% alert threshold, as memory pressure at that level narrows your runway to eviction-level problems. For ElastiCache Serverless using Valkey, monitor ElastiCacheProcessingUnits (ECPUs) as the scaling proxy. ECPU consumption scales with operation complexity and data volume, making it the primary cost and capacity signal for Serverless deployments where node count decisions don’t apply. Redshift Correlate CPUUtilization with QueryQueueTime over a 1-week window. The CPU-vs-queue diagnostic from the Redshift metrics section applies here as your scaling decision input: high CPU points to node scaling, while high queue time with moderate CPU points to WLM slot reconfiguration. Where CloudWatch’s Coverage Falls Short The per-service metrics and tooling above give you solid visibility within each namespace. The gaps show up when you need to work across them: correlating alarms from different services, connecting logs to metrics, and suppressing the noise when a single event triggers alerts everywhere at once. No Native Cross-Service Correlation You can build a CloudWatch dashboard that co-locates DynamoDB ThrottledRequests, ElastiCache Evictions, and Redshift WLMQueueLength on a shared timeline, but it’s manual widget assembly with no causal linking between the graphs. The assembly is also fragile: every new table, cluster, or queue requires manual dashboard updates to keep the view current. Log-to-Metric Correlation Is Manual Connecting a slow Redshift query logged in STL_QUERY to a spike in DynamoDB SuccessfulRequestLatency at the same timestamp requires opening CloudWatch Logs Insights for Redshift audit logs, querying by timestamp range, then manually comparing results against the DynamoDB metric timeline. The Enhanced Query Monitoring GA from January 2025 reduces this friction for Redshift-internal diagnosis, but the cross-service correlation step remains a human task. Cross-Account Visibility CloudWatch Database Insights added cross-account and cross-region support for database fleet monitoring on November 21, 2025. Verify the current scope of service coverage at the time of your deployment, as the announcement references database fleet monitoring broadly, and the specific inclusion of ElastiCache and Redshift alongside RDS and Aurora should be confirmed against current documentation. Alert Fatigue Across Three Namespaces Each service generates its own alarm stream with no dependency-aware suppression between services. When a single network event causes DynamoDB latency to rise, ElastiCache hit rate to drop, and Redshift WLM queue depth to increase, CloudWatch fires alarms across three separate notification channels simultaneously. The on-call engineer receives three alerts for a single root cause event, with no automated path from any alarm to the triggering condition. ManageEngine OpManager Nexus addresses these gaps directly: it auto-discovers DynamoDB tables, ElastiCache clusters, and Redshift clusters within your AWS account, builds correlated dashboards that connect metrics across all three services on a shared timeline without manual widget assembly, and applies dependency-aware alarm suppression that treats downstream symptoms of a single event as a grouped incident. For teams running two or more of these managed database services, the operational delta between nine isolated CloudWatch alarms and a correlated, root-cause-linked view determines where monitoring hours get spent or recovered. Your Monitoring Baseline: Nine Alarms and a Unified View The minimum viable monitoring baseline for all three services is nine CloudWatch alarms routed to a single SNS topic. These are practitioner-recommended starting points. Tune each threshold to your observed workload behavior. DynamoDB Alarms Alarm NameMetricThresholdEvaluation PeriodDynamoDB-ThrottlesThrottledRequests> 01 minuteDynamoDB-LatencyP99SuccessfulRequestLatency (p99)> 20ms5 minutesDynamoDB-RCUHighConsumedReadCapacityUnits> 80% of provisioned5 minutes Metric definitions: DynamoDB CloudWatch metrics reference. ElastiCache Alarms Alarm NameMetricThresholdEvaluation PeriodCache-HitRateLowCacheHitRate< 0.905 minutesCache-EvictionsHighEvictions> 100 per minute1 minuteCache-MemoryHighDatabaseMemoryUsagePercentage> 75%5 minutes Metric definitions: ElastiCache CloudWatch metrics reference. Redshift Alarms Alarm NameMetricThresholdEvaluation PeriodRedshift-CPUHighCPUUtilization> 80%5 minutesRedshift-QueueDepthWLMQueueLength> 35 minutesRedshift-QueueWaitQueryQueueTime> 30 seconds5 minutes Metric definitions: Redshift CloudWatch metrics reference. Route all nine alarms to a single SNS topic. Tag each alarm with a Service dimension (values: DynamoDB, ElastiCache, Redshift) so your incident management tooling can filter and group by service. This configuration puts all three alarm streams in one place and makes it detectable when multiple service alarms fire within a short time window, which is the observable signature of a cross-service cascade. Run these nine alarms for a week or two. You’ll see the pattern: multiple alarms firing within the same minute window for what turns out to be a single root cause, with no automated way to connect them. That delta is what a correlated observability layer closes. ManageEngine OpManager Nexus provides that layer for AWS database services, with auto-discovery, cross-service dashboards, and dependency-aware alarm suppression out of the box. What’s your current setup for correlating alarms across managed AWS services? If you’re running DynamoDB, ElastiCache, or Redshift and have found thresholds or approaches that work well for your team, share them in the comments.

By Damaso Sanoja
Throughput vs Goodput: The Performance Metric You Are Probably Ignoring in LLM Testing
Throughput vs Goodput: The Performance Metric You Are Probably Ignoring in LLM Testing

In this blog post, we will see the difference between throughput and goodput, why throughput alone can give you a dangerously false sense of confidence, and how goodput, the metric championed by NVIDIA's AIPerf tool, tells you the truth about your LLM deployment. If you have ever shipped a feature that looked perfectly healthy in your monitoring dashboard but fell apart under real user load, this post is for you. What Is Throughput? Throughput is one of the oldest and most familiar metrics in performance testing. Simply put, it answers the question: how much work can the system do in a given time window? Depending on the context, throughput is expressed as: Requests per second (req/s) – most common in API and web performance testing Transactions per second (TPS) – common in database and payment system testing Megabytes per second (MB/s) – common in file transfer and network testing Tokens per second – specific to LLM inference workloads In a JMeter test report, the throughput number is front and center. In a k6 summary, it shows up as http_reqs. In a Grafana dashboard, it is usually one of the first panels you look at. Throughput tells you volume. It does not tell you the quality. The Problem With Throughput Alone Here is a scenario that should feel familiar. You run a load test. Throughput looks great, 100 req/s. No errors. You ship. Real users start complaining that the app feels sluggish or unresponsive. You go back to your dashboard. Throughput is still 100 req/s. Green across the board. What Happened? The system was technically completing requests. But a large portion of those requests were taking 4 to 5 seconds to respond instead of the 500ms your users expect. The requests were counted as successful because they returned HTTP 200. Throughput does not care about latency. It just counts completions. This is the gap. And in traditional web performance testing, experienced engineers close that gap by adding percentile latency checks (p95, p99) as assertions. But in LLM performance testing, the problem is deeper. The Dosa Stall Analogy Imagine a busy dosa stall in Coimbatore during the morning rush. The stall owner proudly says, "We served 100 dosas this hour." That is throughput. 100 dosas per hour. But here is the real picture: 28 dosas were served cold because the tawa was overcrowded 15 dosas arrived 20 minutes after the order because the batter queue was too long 5 dosas were undercooked Only 52 dosas were served hot, crispy, and within the 5-minute promise. That is goodput. 52 dosas per hour. The stall is technically operating at 100 dosas/hour. But only 52 of them actually met the quality standard the customer was promised. Now imagine this stall is your LLM API, and each dosa is an inference request. The "hot and crispy within 5 minutes" rule is your SLO. What Is Goodput? Goodput is the number of requests per second that completed and met all your defined SLO constraints. This definition comes directly from NVIDIA's AIPerf tool (the successor to GenAI-Perf), which is the industry standard for LLM inference benchmarking. In AIPerf, you define goodput constraints when you run a benchmark: Shell aiperf profile \ --model "llama-3.1-70b" \ --url http://inference-server:8000 \ --goodput-ttft 500 \ --goodput-itl 100 This tells the tool: only count a request toward goodput if: Time to First Token (TTFT) was under 500ms, AND Inter-Token Latency (ITL) was under 100ms A request that completes but violates either constraint does not count. It is a failed request from the user's perspective, even if the HTTP status code was 200. How Goodput Works in LLM Performance Testing LLM inference has two latency metrics that users feel directly: Time to First Token (TTFT) is how long the user waits before they see the first word of the response. This is what makes an LLM feel fast or laggy. A high TTFT means users are staring at a blank screen or a loading spinner. Inter-Token Latency (ITL) is the delay between each token in the streamed response. A high ITL makes the text appear to stutter or pause mid-sentence, which breaks the feeling of a natural conversation. Both of these metrics degrade under load. As concurrency increases, the inference server queue backs up. TTFT climbs first requests, sit waiting to be processed. ITL can follow if GPU compute is saturated. Throughput stays stable through all of this. The server is still completing requests. It is just that the user experience is becoming progressively worse. Goodput captures that degradation directly. When TTFT crosses your SLO threshold, those requests stop contributing to goodput. The goodput number drops visibly, even while throughput holds steady. As I showed in an earlier post, 99% of Requests Failed and My Dashboard Showed Green, you can have a request throughput of 0.91 req/s that looks reasonable, while goodput sits at 0.01 req/s, meaning 99% of requests were silently breaching the SLO. The Formula Goodput is straightforward once you have your SLO thresholds defined: Plain Text Goodput (req/s) = Requests that met ALL SLO constraints / Total measurement time (seconds) For an LLM workload with TTFT and ITL SLOs: Plain Text A request counts toward goodput if: TTFT < ttft_slo_ms AND ITL < itl_slo_ms Notice that it uses AND, not OR. Both conditions must be satisfied. A request with excellent ITL but a TTFT of 3 seconds still fails. The user waited 3 seconds before seeing anything, which is a broken experience, regardless of how smooth the streaming was after that. Pseudocode: Calculating Goodput Here is a simplified pseudocode showing how goodput is computed behind the scenes: Python // Configuration TTFT_SLO = 500 // milliseconds ITL_SLO = 100 // milliseconds // Tracking total_requests = 0 compliant_requests = 0 measurement_start = current_time() // Run benchmark loop for each request sent: result = send_llm_request(prompt) total_requests++ ttft = result.time_to_first_token_ms itl = result.inter_token_latency_ms if ttft <= TTFT_SLO AND itl <= ITL_SLO: compliant_requests++ // Calculate metrics measurement_duration_seconds = current_time() - measurement_start throughput = total_requests / measurement_duration_seconds goodput = compliant_requests / measurement_duration_seconds print("Request Throughput (req/s): " + throughput) print("Goodput (req/s): " + goodput) print("SLO Compliance Rate (%): " + (compliant_requests / total_requests * 100)) When your system is healthy and under low load, throughput and goodput will be very close. As concurrency increases and the system starts to struggle, you will see goodput diverge downward from throughput. That divergence is your early warning signal. Throughput vs Goodput: Side-by-Side DimensionThroughputGoodputWhat it measuresAll completed requests per secondCompleted requests per second that met SLOSLO-awareNoYesFails silently on latency degradationYesNoTypical unitsreq/s, TPS, MB/s, tokens/sreq/sTool exampleJMeter, k6, wrkNVIDIA AIPerfUse caseCapacity planning, raw volumeUser experience validation, production readinessCan look good while users sufferYesNo When Should You Use Each Metric? Use throughput when: You are doing capacity planning and need to understand raw system limits You are comparing infrastructure configurations (e.g., 2 GPU vs 4 GPU) at the same load level You are generating a baseline before adding SLO constraints Use goodput when: You are validating the production readiness of an LLM endpoint You want to know whether users are actually being served well, not just served You are running a concurrency sweep to find the point where your SLO breaks You are integrating LLM performance checks into your CI/CD pipeline A healthy practice is to report both numbers together. If goodput and throughput are close, your system is healthy. If they diverge significantly, you have a quality problem that raw throughput is hiding. Key Takeaway Throughput answers: Can the system handle the volume? Goodput answers: Is the system actually serving users well at that volume? In traditional performance testing, latency SLOs were enforced through assertions and percentile checks. In LLM performance testing, goodput formalizes this into a single metric that is directly comparable to throughput. NVIDIA's AIPerf makes this measurable out of the box with the --goodput-ttft and --goodput-itl flags. Next time you look at a load test result, ask yourself: Do I know the goodput number? If the answer is no, you only have half the picture. Happy testing!

By NaveenKumar Namachivayam DZone Core CORE
Optimizing High-Volume REST APIs Using Redis Caching and Spring Boot (With Load Testing Code)
Optimizing High-Volume REST APIs Using Redis Caching and Spring Boot (With Load Testing Code)

High-volume REST APIs can easily become bottlenecked by database access, leading to high latency and poor throughput. Even after optimizing SQL queries and adding indexes, a database call might take hundreds of milliseconds, still far slower than a competitor’s 50 ms response that leverages caching. In-memory caching offers orders of magnitude faster data access. Traditional databases measure response times in milliseconds, while Redis operations complete in microseconds. By storing frequently accessed data in memory, APIs can handle dramatically more requests per second with much lower latency. As an example, one test showed that using Redis cut an expensive request’s response time from over 10 seconds down to under 1 second. Setting Up Redis Caching in Spring Boot Before diving into patterns, let’s ensure the basic setup is in place. We assume you have a local Redis server running. In your Spring Boot project, include the necessary dependencies for caching and Redis integration. For example, add the following to your Maven pom.xml: XML <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-cache</artifactId> <version>3.1.5</version> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-data-redis</artifactId> <version>3.1.5</version> </dependency> These bring in Spring’s generic caching support and the Redis connector. Next, enable caching in your application by annotating a configuration or main class with @EnableCaching. Spring Boot will auto-configure a RedisCacheManager if it finds Redis on the classpath. You can then define cache settings via configuration. For example, you might set a default time to live for cache entries in application.properties or via a RedisCacheConfiguration bean. A simple property-based configuration for a local Redis could be: Properties files spring.cache.type=redis spring.redis.host=localhost spring.redis.port=6379 spring.cache.redis.time-to-live=600000 # 600000 ms = 10 minutes TTL Now we have a basic cache setup. Let’s explore caching patterns and how to implement them in Spring Boot. Write-Through and Write-Behind Caching Caching isn’t just for reads; we also need a strategy for writes. Write-through and write-behind are patterns to handle data modifications in a cached system: Write-Through On every data write, the application synchronously writes to the database and the cache. This ensures the cache is always up-to-date with the latest data. In practice, a write-through approach might perform the database operation, then immediately update the Redis cache with the new value. Spring’s caching abstraction can support this via annotations like @CachePut or by combining a normal save method with a manual cache update. For example, in a product service, we might do: Java @CachePut(value = "products", key = "#product.id") public Product updateProduct(Product product) { // Save to DB first Product saved = repo.save(product); return saved; // Spring will put this return value into "products::[id]" cache } This method will update the database and also put the new product data into the cache under the given key. The next read for that product can be served from cache immediately, with no stale data. If we delete an item, we can use @CacheEvict to remove it from the cache at the same time as removing it from the DB, preventing ghost entries. Write-Behind (Write-Back) In this less common strategy, the application writes to the cache first and defers the database write till later. The idea is to batch or coalesce many writes to reduce DB pressure. Avoiding Cache Stampede (Thundering Herd) When caching for high-volume traffic, cache stampedes are a serious concern. A stampede occurs when a cache entry expires or is missing, and many concurrent requests attempt to fetch the same data from the database at once. In a high QPS system, this can overwhelm the database and essentially negate the benefit of caching. We need strategies to prevent dozens or hundreds of threads from piling onto the DB when a popular item cache invalidates. One common solution is to use locking or synchronization around cache misses. The idea is to ensure only one thread does the expensive database fetch and populates the cache, while the others wait or get served a stale value. In a single-instance application, you might synchronize on a Java lock per key. In a distributed environment, you’ll want a distributed lock. Redis itself can be used to implement this. For our Spring Boot application, we could integrate Redisson and use it in the service method. For instance: Java RLock lock = redissonClient.getLock("lock:product:" + productId); boolean acquired = lock.tryLock(5, 10, TimeUnit.SECONDS); // wait up to 5s to acquire, auto-release after 10s if (acquired) { try { // Double-check cache after acquiring lock Product cached = redisTemplate.opsForValue().get(cacheKey); if (cached != null) { return cached; } // Cache still empty, fetch from DB and update cache Product dbData = repo.findById(productId); redisTemplate.opsForValue().set(cacheKey, dbData, Duration.ofMinutes(10)); return dbData; } finally { lock.unlock(); } } else { // Could not acquire lock (timed out) – fallback to a stale cache or return an error ... } In the above pseudocode, multiple threads hitting a missing cache key will attempt to tryLock. One will succeed and do the DB query, while others wait up to 5 seconds. Once the first thread populates the cache and releases the lock, the others will find the data in the cache and avoid hitting the DB. This approach effectively serializes the cache miss for a given key, preventing a herd of concurrent DB calls. It’s a bit heavy, so you might not use it for every key; typically, you'll use it for very hot items or expensive queries that you know could trigger stampedes. Simpler techniques can also mitigate stampedes, like cache early recomputation or using slightly randomized TTLs so not everything expires at the same time. Load Testing the Impact of Caching With JMeter After implementing Redis caching, it’s critical to verify the performance improvements under realistic load. Apache JMeter is a popular tool for simulating concurrent users and measuring response times and throughput of your API. We can use JMeter to compare the API’s behavior with and without cache and ensure that our caching does indeed handle high volume as expected. For example, suppose we want to test an endpoint /products/{id} which we’ve optimized with caching. We can create a JMeter test plan with a Thread Group of, say, 100 threads and loop them to send requests for various product IDs. JMeter will report metrics like average response time, throughput, error rate, etc. In a baseline test, you might observe higher latencies and lower throughput. Then, in a test with the cache warmed (most requests hitting the cache), you should see a dramatic reduction in response time and the ability to handle more requests per second. In one real-world inspired demo, using Redis caching improved latency from 10 seconds on a cold miss to under 1 second on subsequent hits. Another way to look at it: memory caching can serve data so fast that your throughput might be an order of magnitude higher than relying solely on the DB. This aligns with the earlier statement that no amount of DB tuning beats data served from an in-memory cache. Using JMeter Set up JMeter (you can run it in GUI mode to design the test plan, and then use non-GUI mode for the actual high-load run for better accuracy). Configure an HTTP Request sampler pointing at your API (e.g., GET http://localhost:8080/products/1234). Use a Thread Group to simulate the desired number of concurrent users and iterations. You can add a Timer if you want a delay between requests, or just hammer the API as fast as possible to find its max throughput. Add listeners like Summary Report or Aggregate Report to gather results. To automate performance testing, you can even integrate JMeter with your build. A Maven plugin exists to run JMeter tests as part of a build pipeline. JMeter Configuration Snippet Suppose we want to quickly run a load test from the command line (non-GUI). We could use a command like: Shell jmeter -n -t path/to/testplan.jmx -l results.jtl -Jthreads=100 -Jduration=60 This would run the JMeter test plan for 60 seconds with 100 threads, logging results to results.jtl. Make sure to monitor your system while testing, especially if everything is on the same machine; the load test could itself become a bottleneck or interfere with results if not planned carefully. As a quick check, you can also use Spring Boot Actuator metrics or Redis monitoring to see cache hit rates. A healthy caching layer under load should show a high cache hit percentage, which correlates with lower DB usage and faster responses. Conclusion Optimizing a high-volume REST API often requires rethinking data access patterns, and Redis caching is a powerful technique to achieve massive performance gains. By using the cache-aside pattern, we serve most reads from fast in-memory storage, drastically reducing latency and database load. With write-through strategies and careful cache invalidation, we keep cached data consistent with the source of truth. It’s equally important to anticipate real-world issues like cache stampedes using locks or other techniques to prevent cache misses from overwhelming your database in a traffic surge. Finally, always test under load. Use tools like JMeter to simulate concurrent access and measure the impact of your caching. You should observe significant improvements in throughput and response times, validating that the cache is doing its job. If the results aren’t as expected, that’s an indication to refine your caching strategy or investigate bottlenecks.

By Mallikharjuna Manepalli
Manual Investigation: The Hidden Bottleneck in Incident Response
Manual Investigation: The Hidden Bottleneck in Incident Response

Every engineering team I talk to has the same problem. When a P1 fires, coding stops. An engineer gets pulled in, spends 30 to 60 minutes hunting through logs, tracing requests across three or four systems, and cross-referencing deployment history before they can even form a hypothesis about what broke. By the time they have a diagnosis, they've already burned the better part of their morning. We've normalized this. It's just become part of the job. But the math is brutal: A team handling 50 incidents per month at 4 to 8 hours of resolve time each is looking at 200 to 400 engineering hours lost. That's a full month of a senior engineer's capacity dedicated entirely to looking backward. The investigation workflow itself hasn't changed in 20 years. Why Manual Investigation Breaks Down in Modern Systems Traditional incident response was designed for simpler architectures. An on-call engineer would look at a dashboard, pull some logs, and apply tribal knowledge to find the cause. For known failure patterns with established runbooks, this still works. Modern distributed systems are a different animal. A single error can originate in one service, propagate through a message queue, surface in a database connection pool, and present to the user as a generic 500 error. Tracing that sequence manually means jumping between your observability platform, your deployment tool, your APM, and whatever documentation exists for the relevant service. Four problems make this worse: Multi-system correlation. Errors don't stay in one place. Engineers have to manually trace a transaction across APIs, databases, and third-party dependencies to find where things actually broke.Signal-to-noise ratio. A production system generates thousands of log entries per second during a normal minute and far more during an incident. Finding the meaningful signal by hand is slow and error-prone.Context reconstruction. Understanding the root cause requires knowing what changed recently, such as deployments, config updates, and infrastructure changes. That information is scattered across tools with incompatible formats and permission models.Cognitive load under pressure. During a P0, engineers are simultaneously investigating, making decisions, and fielding status requests from stakeholders. Typically, no one person does all three of these well at once. Under that kind of load, things can easily get missed. Manual correlation is where investigation time disappears. The workflow needs to change. How AI Changes the Investigation Phase Now, AI does the detective work before the engineer ever opens the ticket. The alert is just the starting gun. 1. Automated Timeline Reconstruction AI correlates signals across your systems in real time. A reconstructed timeline might look like: 13:42:15 – Deployment completed13:42:47 – First timeout errors appear13:43:12 – Error rate reaches 15%13:44:03 – Database connection pool exhausted That sequence, assembled automatically, tells the engineer exactly where to look. No log-grepping required. 2. Similar Incident Matching Most incidents aren't genuinely novel. They're variations on failure patterns the team has seen before, often caused by the same underlying conditions. The challenge is that the previous incident was three months ago, handled by a different engineer, documented inconsistently, and buried in a ticketing system nobody queries. AI indexes past incidents and how they were resolved. When a new incident fires, it pulls up the closest matches instantly. "Error signature matches Issue #4532 from six weeks ago. Both followed Redis deployments. Resolution: connection pool adjustment." That's the kind of context that currently lives in one engineer's head, if anyone's. And when that engineer leaves, it's gone. 3. Parallel Hypothesis Testing With Confidence Scoring Human diagnosis is linear. We check one hypothesis, rule it out, and move to the next. Under time pressure, this sequential approach extends MTTR every time the first guess is wrong. AI evaluates multiple hypotheses simultaneously using a multi-agent validation architecture. Specialized agents analyze code changes, infrastructure metrics, and error patterns in parallel, then cross-check findings before surfacing anything to a human. The output is confidence-scored leads: High (85%): Connection pool exhaustion. Deployment v2.4 increased concurrent requests without adjusting pool size.Medium (60%): Database performance degradation.Low (25%): Third-party authentication issue. The engineer can focus immediately on the 85%. 4. Contextual Remediation Guidance Finding the root cause doesn't settle what to do next. Engineers frequently have to pause after diagnosis to hunt for runbooks, check with the original developer, or make a judgment call with incomplete information about side effects. AI covers that ground, recommending specific remediation steps based on system state and past resolutions: "Recommended action: Increase API connection pool to 100 in config/database.yml. Rolling restart required. Expect error rate to drop within 2 minutes." The Architecture Behind It Production-grade AI investigation runs on a composite architecture, not a single model, built to handle the volume, speed, and accuracy requirements of real incidents. Traditional ML handles high-volume anomaly detection and noise reduction at the signal layer. Small language models handle fast, private log parsing where latency matters. LLMs take over for synthesis and generating summaries that engineers can actually act on. Multi-agent architectures add a "critic" layer where specialized agents cross-check findings before anything surfaces to a human, which is where false positive reduction actually happens. This matters for teams evaluating whether to build internally. Connecting an LLM to Slack and pointing it at a vector database of logs is straightforward. Building a system that handles novel incidents accurately, runs during a log storm, and never sends raw customer data to a public model endpoint is not. The retrieval pipeline alone (knowing which 50 log lines are relevant out of 5 million) is a substantial engineering problem. Honestly, that's what kills most homegrown attempts. What This Means for SREs Right now, SREs spend 40 to 60% of their time on manual data gathering, repeated context reconstruction, and re-investigating failure patterns the team has already solved. That's the portion AI handles. At Strudel, we've seen teams cut investigation time from 30 to 60 minutes down to under 60 seconds on incidents where the system has relevant historical context. Engineers are still putting in the hours, just on different work: making decisions, checking the AI's conclusions, and building systems that prevent recurrence. At 50 incidents a month, that time adds up fast.

By Brian Kaufman
Observability in Spring Boot 4
Observability in Spring Boot 4

In microservices, you’ve likely broken a cold sweat more than once when a request suddenly 'vanishes' the moment it hits a Database or a Message Broker. It is a true operational nightmare. However, with the release of Spring Boot 4 in early 2026, building a comprehensive Observability system has become easier than ever, thanks to the 'all-in' support from micrometer tracing. The Problem: "Anonymous" Queries When your database starts lagging (slow queries), you check the processlist in MySQL only to find a vague line: SELECT * FROM orders WHERE status = 'PENDING' ... At this point, the ultimate head-scratcher arises: "Who triggered this? Which API is executing this statement?" Without a Trace ID embedded directly into the query, you are guaranteed to spend hours digging through logs just to piece the two ends together. The Solution: "Pinning" Trace IDs Directly into SQL Comments With Spring Boot 4, we no longer need complex third-party libraries or clunky, "home-brewed" workarounds. Everything is now handled seamlessly through Spring Boot Actuator and Hibernate StatementInspector. The concept is simple: we attach the Trace ID directly to the SQL statement as a comment. When looking at the Database logs, you will know exactly where that request originated. Project Setup Let’s start by initializing a Spring Boot 4.0.2 project with the following structure: File: build.gradle To unlock the power of Observability, you will need to include these key dependencies in your configuration file: Groovy plugins { id 'java' id 'org.springframework.boot' version '4.0.2' id 'io.spring.dependency-management' version '1.1.7' } group = 'org.example' version = '0.0.1-SNAPSHOT' description = 'demo-trace' java { toolchain { languageVersion = JavaLanguageVersion.of(17) } } repositories { mavenCentral() } dependencies { implementation 'org.springframework.boot:spring-boot-starter-data-jpa' implementation 'org.springframework.boot:spring-boot-starter-web' implementation 'org.springframework.boot:spring-boot-starter-actuator' implementation 'io.micrometer:micrometer-tracing-bridge-otel' implementation 'com.mysql:mysql-connector-j' compileOnly 'org.projectlombok:lombok' annotationProcessor 'org.projectlombok:lombok' } tasks.named('test') { useJUnitPlatform() } Implementing the SQL Inspector Now, we will create a class that acts as a "gatekeeper" to intercept and modify every SQL statement just before it is sent to the Database. File: SqlCommentStatementInspector.java Here is how we use Hibernate's StatementInspector to automatically inject the Trace ID into your queries: Java package org.example.demotrace; import lombok.extern.slf4j.Slf4j; import org.hibernate.resource.jdbc.spi.StatementInspector; import org.slf4j.MDC; import java.net.InetAddress; @Slf4j public class SqlCommentStatementInspector implements StatementInspector { private static String HOST_NAME; static { try { HOST_NAME = InetAddress.getLocalHost().getHostName(); } catch (Exception e) { log.error("Cannot get local host name", e); HOST_NAME = "unknown-host"; } } @Override public String inspect(String sql) { // Elastic APM Agent auto add traceId vào MDC with key "traceId" String traceId = MDC.get("traceId"); if (traceId == null) traceId = "no-trace"; return sql + " /* host: " + HOST_NAME + "; traceId: " + traceId + " */"; } } To complete the process, we need a "bridge" to ensure the Trace ID is always available within the context of each request. Below is how we set up a Filter to manage this. Linking the Trace ID to MDC (Mapped Diagnostic Context) For the SqlCommentStatementInspector to accurately retrieve the Trace ID, we must ensure this information is pushed into the MDC. We will implement a standard Servlet Filter to handle this "identification" process the moment a request hits the system. File: TraceIdFilter.java This code snippet synchronizes the Trace ID from Micrometer into the Log context, ensuring that both your log files and SQL comments are "aligned under a single source of truth": Java package org.example.demotrace; import jakarta.servlet.*; import jakarta.servlet.http.HttpServletRequest; import jakarta.servlet.http.HttpServletResponse; import org.slf4j.MDC; import org.springframework.stereotype.Component; import java.io.IOException; import java.util.UUID; @Component public class TraceIdFilter implements Filter { private static final String TRACE_ID_HEADER = "X-Trace-Id"; private static final String TRACE_ID_MDC_KEY = "traceId"; @Override public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException { HttpServletRequest httpRequest = (HttpServletRequest) request; HttpServletResponse httpResponse = (HttpServletResponse) response; // get trace from header or create String traceId = httpRequest.getHeader(TRACE_ID_HEADER); if (traceId == null || traceId.isEmpty()) { traceId = UUID.randomUUID().toString(); } MDC.put(TRACE_ID_MDC_KEY, traceId); httpResponse.setHeader(TRACE_ID_HEADER, traceId); try { chain.doFilter(request, response); } finally { // remove trace after done MDC.remove(TRACE_ID_MDC_KEY); } } } Hibernate Configuration To let Spring Boot know it should use the SqlCommentStatementInspector for every database transaction, you only need to declare a single line in your configuration file. File: application.properties Add the following line to your configuration file: Properties files spring.application.name=demo-trace spring.datasource.url=jdbc:mysql://mysql:3306/tracing_db?createDatabaseIfNotExist=true spring.datasource.username=root spring.datasource.password=root spring.jpa.hibernate.ddl-auto=update # Register statement_inspector spring.jpa.properties.hibernate.session_factory.statement_inspector=org.example.demotrace.SqlCommentStatementInspector spring.jpa.show-sql=true management.tracing.sampling.probability=1.0 logging.pattern.level=%5p [${spring.application.name:},%X{traceId:-},%X{spanId:-}] Test Run: Create a Data Query API We will create a UserController to simulate a real user request. When this API is called, Spring Boot 4 will automatically generate a Trace ID, pass it through the filter, attach it to the MDC, and finally embed it into the SQL query. File: UserController.java Java package org.example.demotrace.controller; import lombok.RequiredArgsConstructor; import lombok.extern.slf4j.Slf4j; import org.example.demotrace.entity.User; import org.example.demotrace.repository.UserRepository; import org.springframework.web.bind.annotation.*; import java.util.List; @Slf4j @RestController @RequestMapping("/api/users") @RequiredArgsConstructor public class UserController { private final UserRepository userRepository; @PostMapping public User createUser(@RequestBody User user) { log.info("Request Success!"); User rs = userRepository.save(user); userRepository.findUserSlowly(rs.getId()); return rs; } @GetMapping public List<User> getAllUsers() { return userRepository.findAll(); } } Entity: User.java This is the structure of the data table we will be querying. You can use Lombok to keep the code clean and concise as shown below: Java package org.example.demotrace.entity; import jakarta.persistence.*; import lombok.Data; @Entity @Table(name = "users") @Data public class User { @Id @GeneratedValue(strategy = GenerationType.IDENTITY) private Long id; private String name; private String email; } Repository: UserRepository.java Implementing a simulated slow query to test tracing at the MySQL database layer. Java package org.example.demotrace.repository; import org.example.demotrace.entity.User; import org.springframework.data.jpa.repository.JpaRepository; import org.springframework.data.jpa.repository.Query; import org.springframework.data.repository.query.Param; import java.util.Optional; public interface UserRepository extends JpaRepository<User, Long> { @Query(value = "SELECT u.*, SLEEP(50000) FROM users u WHERE u.id = :id", nativeQuery = true) Optional<User> findUserSlowly(@Param("id") Long id); } Docker Compose and Dockerfile for Kibana APM Integration Below are the Docker Compose and Dockerfile configurations required to run the application and visualize tracing data within Kibana APM. File: docker-compose.yml YAML services: mysql: image: mysql:8.0 environment: MYSQL_ROOT_PASSWORD: root volumes: # Map file init vào container - ./init.sql:/docker-entrypoint-initdb.d/init.sql ports: - "3306:3306" healthcheck: test: ["CMD", "mysqladmin" ,"ping", "-h", "localhost"] timeout: 20s retries: 10 elasticsearch: image: docker.elastic.co/elasticsearch/elasticsearch:7.17.0 environment: - discovery.type=single-node - "ES_JAVA_OPTS=-Xms512m -Xmx512m" ports: - "9200:9200" apm-server: image: docker.elastic.co/apm/apm-server:7.17.0 depends_on: [elasticsearch] ports: ["8200:8200"] command: > apm-server -e -E output.elasticsearch.hosts=["elasticsearch:9200"] -E apm-server.host="0.0.0.0:8200" kibana: image: docker.elastic.co/kibana/kibana:7.17.0 depends_on: [elasticsearch] ports: ["5601:5601"] app: build: . dns: - 8.8.8.8 - 8.8.4.4 depends_on: mysql: condition: service_healthy apm-server: condition: service_started ports: - "8080:8080" Dockerfile: YAML # Stage 2: run (Runtime) FROM eclipse-temurin:17-jre-jammy WORKDIR /app # Copy file jar # (need build app from gradle local or ide) COPY build/libs/demo-trace-0.0.1-SNAPSHOT.jar app.jar # download agent apm ADD https://repo1.maven.org/maven2/co/elastic/apm/elastic-apm-agent/1.43.0/elastic-apm-agent-1.43.0.jar elastic-apm-agent.jar ENTRYPOINT ["java", \ "-javaagent:/app/elastic-apm-agent.jar", \ "-Delastic.apm.service_name=demo-trace-service", \ "-Delastic.apm.server_urls=http://apm-server:8200", \ "-Delastic.apm.application_packages=org.example.demotrace", \ "-Delastic.apm.enable_log_correlation=true", \ "-jar", "app.jar"] Monitoring and "Crushing" Slow Queries Now that the coding is finished, let's deploy the environment to verify our results. We will use Docker to simulate a complete, production-ready system. Deployment with Docker First, build your project (ensure you have JDK 17+ installed): ./gradlew clean build. Next, spin up the technology stack (including the App, MySQL, and Observability tools): docker compose up -d. "Tracing" in Action Imagine you receive an alert that the Database is hanging. You log into MySQL and run the command to inspect the currently executing processes: MySQL SELECT ID,USER,HOST,DB,COMMAND,TIME,STATE,INFO FROM information_schema.processlist WHERE COMMAND != 'Sleep' AND INFO IS NOT NULL ORDER BY TIME DESC; The result will look like this: Why Is This a "Lifesaver"? Identify the culprit: Looking at the Info column, you can immediately see the traceId=6794d2e1b....Backtrace with ease: Simply copy this Trace ID and paste it into your log management system (such as Grafana Loki or ELK). Instantly, you’ll uncover the request's entire journey: where it started, which user triggered it, and exactly why it’s lagging.Decisive action: If this query is hanging the system, you can confidently execute KILL 12 (the process ID) because you know exactly which feature it belongs to and what the impact of killing it will be. Lightning-Fast Backtracing This is the "money shot" — the most valuable part of the entire process. Once you’ve identified a "culprit" query in the database, finding its origin takes only a few seconds: Extract the trace: Copy the traceId from the INFO column in the MySQL SHOW PROCESSLIST output.Search on Kibana: Navigate to your Kibana dashboard (typically at http://localhost:5601).Paste and search: Paste the traceId into the search bar.The big reveal: Kibana will instantly display every log entry associated with that ID. You will discover: Which user was performing the action.Which service sent the request.The input parameters provided to that specific API.And even the preceding processing steps and how much time each one consumed. Application logs from the service environment: Every trace now provides end-to-end visibility, spanning from the initial user request, cutting through the application layer, and reaching down to the deepest database level. Leveling Up: Tracing Through CDC and Kafka Real-world systems don't just stop at the database. When you need to synchronize data across other services via change data capture (CDC) and Kafka, the Trace ID acts as a "Golden Thread" connecting every link in the chain. CDC (e.g., Debezium): When scanning the Database Binlog, the CDC capture process picks up the SQL content — including the comments containing the Trace ID we embedded. You can then extract this ID and include it in the Event Metadata.Kafka headers: Spring Boot 4 provides native support for context propagation. When Service A sends a message to Kafka, this identifier is automatically "injected" into the Kafka Header.Scalability: Service B (the Consumer) will automatically restore the context from that Header, continuing to log activities under the same unique Trace ID. Summary The synergy between Spring Boot 4, SQL Comment Tracing, and Kafka CDC creates an incredibly robust monitoring ecosystem: Transparency: You gain a crystal-clear understanding of the "origin story" behind every single database query.Loose coupling: You can freely scale and expand your services without the fear of requests "vanishing" or losing their trail.Performance: You can fully leverage Kafka's asynchronous processing power while maintaining comprehensive, end-to-end observability.

By ha dinh thai

Top Performance Experts

expert thumbnail

Filipp Shcherbanich

Senior Backend Engineer

IT expert with over 13 years of experience as a developer, team lead, and engineering manager. Currently a Senior Backend Engineer at a major international company. Active mentor and expert in tech communities.
expert thumbnail

Eric D. Schabell

Director Technical Marketing & Evangelism,
Chronosphere

Eric is Chronosphere's Director Community & Developer. He's renowned in the development community as a speaker, lecturer, author, baseball expert, maintainer and CNCF Ambassador. His current role allows him to help the world understand the challenges they are facing with observability. He brings a unique perspective to the stage with a professional life dedicated to sharing his deep expertise of open source technologies and organizations. More on https://www.schabell.org.

The Latest Performance Topics

article thumbnail
Observability for Agents and Workflows: Tracing Prompts, Tool Calls, and Business Outcomes End-to-End
Learn how to trace AI agents end to end, from prompts and tool calls to business outcomes, with observability practices for production workflows.
June 5, 2026
by Srinivas Chippagiri DZone Core CORE
· 311 Views
article thumbnail
Why Round-Robin Won't Save You: Load Balancing Challenges in Data Streaming Services With Heterogeneous Traffic
Throughput-based load balancing breaks down when streaming messages have heterogeneous processing costs — the fix is balancing on actual per-partition resource usage.
June 5, 2026
by Semyon Slepov
· 356 Views
article thumbnail
Compliance Automated Standard Solution (COMPASS), Part 10: How OSCAL Mapping Paves the Way for Continuous Compliance Scalability
Mapping Model is the missing architectural layer that transforms multi-framework compliance from exponential complexity to a linear scale.
June 3, 2026
by Vikas Agarwal
· 1,117 Views
article thumbnail
Stop Debugging Glue Jobs Manually: Building an Agentic Observability Layer for Data Pipelines
Glue failures scatter evidence across logs, metadata, and table state. A triage layer pulls it together and flags whether a rerun is safe.
June 2, 2026
by Vivek Venkatesan
· 1,358 Views · 1 Like
article thumbnail
Data Contracts as the "Circuit Breaker" for Model Reliability
AI models do not fail due to bad coding; they fail due to an upstream change in the input. Combine contracts with circuit breakers to stop bad data from entering models.
June 1, 2026
by SRIRAMPRABHU RAJENDRAN
· 1,181 Views
article thumbnail
Every Cache Miss Is a Tiny Tax on Your Performance
Cache misses add latency, load, and cost — optimize your cache hit ratio to reduce unnecessary backend work and keep systems fast at scale.
June 1, 2026
by Jayapragash Dakshnamurthy
· 1,138 Views
article thumbnail
Implementing Observability in Distributed Systems Using OpenTelemetry
Instrument a Python Flask service with OpenTelemetry auto trace requests, export metrics to Prometheus, and inject trace IDs into logs for observability in one setup.
May 29, 2026
by Mugunth Chandran
· 2,423 Views · 1 Like
article thumbnail
Chaos Engineering Has a Blind Spot. Agentic AI Lives in It.
Chaos tests can prove your RAG pipeline survived failure, but not that it stayed correct. Learn how behavioral checks catch silent AI drift.
May 28, 2026
by Sayali Patil
· 4,267 Views · 3 Likes
article thumbnail
Feature Flag Debt: Performance Impact in Enterprise Applications
Feature flags help teams move fast, but when they’re not cleaned up, they quietly add extra code, slow down performance, and make applications harder to maintain.
May 27, 2026
by Poornakumar Rasiraju
· 2,820 Views · 1 Like
article thumbnail
When Perfect Data Breaks: The Journey from Data Quality to Data Observability
Data quality checks often miss silent failures. Use data observability to monitor data in motion and catch issues traditional tools miss.
May 25, 2026
by Divyakumar Savla
· 1,551 Views
article thumbnail
One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes
One SQL query across 4 GPU nodes found a straggler in under a second using eBPF fleet fan-out, no central collector needed.
May 25, 2026
by Ingero Team
· 3,322 Views
article thumbnail
A Scalable Framework for Enterprise Salesforce Optimization: Turning Outcomes Into an Operating System
Outcome-driven intake, clear processes, config-first builds, disciplined releases, and telemetry cut Salesforce cycle time ~90% and boost efficiency 30%+.
May 25, 2026
by Pulkit Singhal
· 1,341 Views
article thumbnail
AWS Managed Database Observability: Monitoring DynamoDB, ElastiCache, and Redshift Beyond CloudWatch
Three AWS managed databases, three dashboards, and one cascade you can only trace by hand. This guide fills the gap CloudWatch leaves open.
May 22, 2026
by Damaso Sanoja
· 3,561 Views · 1 Like
article thumbnail
Throughput vs Goodput: The Performance Metric You Are Probably Ignoring in LLM Testing
See the difference between throughput and goodput, and why throughput alone can give you a dangerously false sense of confidence.
May 21, 2026
by NaveenKumar Namachivayam DZone Core CORE
· 2,855 Views
article thumbnail
Optimizing High-Volume REST APIs Using Redis Caching and Spring Boot (With Load Testing Code)
Cache reads with Redis, use @CachePut for write-through consistency, and prevent stampedes with distributed locks, then prove it works under load with JMeter.
May 18, 2026
by Mallikharjuna Manepalli
· 1,278 Views
article thumbnail
Manual Investigation: The Hidden Bottleneck in Incident Response
Learn about why engineers are stuck investigating instead of fixing and how AI is changing the investigation process for modern systems.
May 18, 2026
by Brian Kaufman
· 1,139 Views
article thumbnail
Observability in Spring Boot 4
Bridge observability gaps in Spring Boot 4 by injecting Micrometer Trace IDs via SQL comments and propagating context through Kafka.
May 15, 2026
by ha dinh thai
· 2,283 Views · 1 Like
article thumbnail
AI Agents Expose a Design Gap in Microservices Resilience Architecture
Microservices assume predictable callers. AI agents break this with non-deterministic calls, fan-out, and retries. Here are 5 core assumption breaks and fixes.
May 13, 2026
by Vineet Bhatkoti
· 2,998 Views · 1 Like
article thumbnail
The Cost of Knowing: When Observability Becomes the Outage
Observability costs spiral when teams optimize for visibility, not cost. Fix it by making spend visible, sampling aggressively, and cutting low-value data.
May 13, 2026
by David Iyanu Jonathan
· 1,239 Views
article thumbnail
Monitoring Spring Boot Applications with Prometheus and Grafana
Demonstrates how to expose Spring Boot metrics with Prometheus and build Grafana dashboards to track memory usage and error rates for production-grade Java services.
May 11, 2026
by Ramya vani Rayala
· 1,780 Views · 1 Like
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×