DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

DZone Spotlight

Monday, June 1 View All Articles »
Every Cache Miss Is a Tiny Tax on Your Performance

Every Cache Miss Is a Tiny Tax on Your Performance

By Jayapragash Dakshnamurthy
Every cache miss is a small but persistent cost on your system. Individually, a single miss may seem insignificant. At scale — thousands or millions of requests — these misses accumulate into measurable latency, increased database load, and degraded user experience. Most systems do not slow down because of one expensive query. They degrade over time due to repeated inefficiencies. Cache misses are one of the most common and overlooked contributors to this pattern. In modern distributed systems, where services depend on multiple layers of infrastructure, even small inefficiencies propagate quickly. What starts as a single cache miss can ripple through downstream systems, amplifying latency and increasing resource consumption. The Hidden Cost of Cache Misses A cache miss is not simply a delay. It triggers a sequence of additional work: A database query or downstream API callNetwork latency and serialization overheadIncreased CPU and I/O utilizationAdditional pressure on shared infrastructure When this pattern repeats under load, even a small drop in cache efficiency can lead to: Increased response timesHigher infrastructure costElevated load on backend systemsGreater likelihood of cascading failures These effects compound quickly in distributed systems, where dependencies amplify delays across services. In high-throughput environments, this can result in bottlenecks that are difficult to diagnose because the root cause appears trivial at an individual request level. What Is Cache Hit Ratio (and Why It Matters) Cache hit ratio measures how often requests are served from cache instead of reaching the primary data store. Cache Hit Ratio (%) = (Cache Hits / Total Lookups) × 100 While it appears to be a simple metric, it reflects how effectively a system avoids unnecessary work. A higher hit ratio typically results in: Lower latencyReduced backend loadImproved scalability A lower hit ratio indicates that the system is repeatedly performing operations that could have been avoided. Over time, this inefficiency translates into increased operational cost and reduced system reliability. Architecture Overview The diagram below compares a cache hit and a cache miss flow. The cache hit path (green) represents a short execution path where data is served directly from cache.The cache miss path (red) illustrates a longer path involving database queries, increased latency, and additional system load. This comparison highlights a fundamental principle: not all requests carry the same cost. Some require significantly more resources than others. Cache hit vs cache miss flow illustrating how cache misses introduce latency, backend load, and system cost at scale. A Simple Example Consider a service receiving 10 requests: The first request results in a cache miss, queries the database, and stores the resultThe next 9 requests are served from cache This results in: 9 cache hits out of 10 → 90% cache hit ratio At a small scale, this level of efficiency appears acceptable. However, even in this simple scenario, the first request is significantly more expensive than the rest, demonstrating how cache misses introduce uneven cost distribution. What Happens at Scale The impact becomes more visible as traffic increases. At a small scale: A cache miss introduces a minor delay At a large scale: A 1% drop in cache hit ratio can result in thousands or millions of additional backend calls This leads to: Increased latency across requestsHigher load on databases and servicesGreater risk of timeouts and failures In distributed architectures, this can trigger cascading effects, where delays propagate across multiple services and amplify system instability. Systems that perform well under normal conditions may degrade rapidly during traffic spikes due to inefficient caching strategies. Trade-Offs: Performance vs Freshness Caching introduces an inherent trade-off between performance and data consistency. Serving data from cache improves latency and reduces backend load, but it also introduces the possibility of stale data. Key considerations include: Strong consistency ensures data accuracy, but increases latencyEventual consistency improves performance but requires tolerance for temporary staleness Techniques such as cache invalidation, write-through caching, and event-driven updates can help manage this balance effectively. The right approach depends on business requirements and tolerance for data freshness. Implementation Considerations Effective caching requires more than introducing a cache layer. Cache warming is essential during deployments or cold starts. Without it, systems experience an initial surge in cache misses that can overwhelm backend dependencies. Time-to-live (TTL) tuning must be handled carefully: Short TTL values lead to frequent expirations and increased missesLong TTL values risk serving stale data Cache key design plays a critical role. Poorly structured or inconsistent keys lead to cache fragmentation, reducing overall effectiveness. Failure handling must also be considered. Systems should handle cache failures gracefully without triggering retry storms or excessive backend load. Real-World Impact In production environments, cache inefficiencies often manifest as: Spikes in database CPU usageIncreased API latency during peak trafficUnexpected infrastructure scalingPerformance degradation after deployments Organizations often scale infrastructure to address these issues. However, in many cases, the underlying problem is inefficient caching rather than insufficient capacity. Improving cache efficiency is one of the most cost-effective ways to enhance system performance and stability. What Is a Good Cache Hit Ratio? There is no universal threshold, but general benchmarks include: Database query caches: 85–95%API response caches: 95–99%Content delivery networks: 99%+ The objective is not to achieve perfection, but to minimize avoidable backend operations. How to Reduce the Cache Miss Tax 1. Preload Frequently Accessed Data Warm caches during deployments to reduce cold-start impact. 2. Tune TTL Carefully Balance expiration timing with data freshness requirements. 3. Use Predictable Cache Keys Ensure consistency and avoid unnecessary misses. 4. Monitor Continuously Track cache hit ratio alongside latency, backend load, and error rates. Conclusion A high cache hit ratio improves performance, but it should not come at the cost of serving outdated data. The goal is not to cache everything, but to cache strategically based on access patterns and system requirements. Every cache miss represents additional work performed by the system. At scale, these small costs accumulate into measurable performance degradation. Reducing cache misses is not only an optimization — it is a foundational requirement for building scalable, efficient systems. More
Implementing Secure API Gateways for Microservices Architecture

Implementing Secure API Gateways for Microservices Architecture

By Mugunth Chandran
Modern microservice architectures consist of many independently deployable services, which brings new security challenges. One crucial best practice is to use an API Gateway as a centralized entry point to enforce security policies. In this article, we explore how to implement a secure API gateway in a microservices environment and demonstrate authentication configuration with code examples. Why Use an API Gateway for Microservices Security In a microservices architecture, each service exposes its own REST endpoints. Without a gateway, clients would have to authenticate individually with each service, a complex and error-prone approach. An API Gateway acts as a single entry point for all client requests, simplifying communication and centralizing cross-cutting concerns like security. As Chris Richardson notes, the API Gateway is also an ideal place to implement cross-cutting concerns, such as authentication. By routing all external traffic through a gateway, you can offload tasks like authentication, authorization, encryption, and rate limiting to this layer. Some key benefits of using an API gateway for security include: Unified access control: The gateway enforces robust access controls in one place. This avoids duplicating auth logic in every microservice and ensures consistent policies.Isolation of internal services: Microservices are not exposed directly to clients. The gateway shields internal APIs, preventing unauthorized access and reducing the attack surface. Backend services can trust that requests have passed through security checks.Monitoring and logging: As the single entry point, the gateway can log requests and monitor traffic for security analytics.Other edge functions: API Gateways often handle routing to the correct service, load balancing, input validation, and rate limiting to mitigate DDoS attacks. Centralizing these functions improves efficiency and maintainability. In summary, a gateway in front of your microservices allows you to apply consistent security measures across all services and simplify each service’s implementation. The microservices can remain focused on business logic while the gateway manages authentication and other front-line defenses. Open-Source API Gateway Solutions There are numerous technologies available to implement the API Gateway pattern. Since we are focusing on open-source solutions, here are a few popular choices: Kong gateway: An open source, high-performance API gateway built on NGINX. Kong supports flexible routing rules and a rich plugin ecosystem for authentication, rate limiting, transformations, and more.Envoy proxy: A modern L7 proxy often used in service meshes. Envoy can serve as an edge gateway with filters for features like JWT verification.Traefik: A cloud-native edge router written in Go. Traefik integrates well with orchestrators and provides middleware for things like basic authentication, OAuth/OIDC via forward-auth, and automatic TLS. It’s often used for its easy dynamic configuration and Let’s Encrypt integration.Others: There are open-source API management solutions like Tyk, Gravitee, WSO2 API Manager, KrakenD, etc. Each offers varying features, but at their core, they all provide gateway functionality to route and secure microservice APIs. Each of these solutions can secure your microservices, but their configuration and feature set differ. In the next section, we’ll focus on implementing authentication using Kong API Gateway as an example. Kong is a popular choice due to its lightweight nature and plugin flexibility, but the concepts will be similar for other gateways. Implementing Authentication in an API Gateway To illustrate how to implement a secure API gateway, we’ll walk through setting up Kong Gateway in front of a microservice and enabling authentication. The goal is to require a valid JSON Web Token (JWT) for clients calling the microservice via the gateway. 1. Defining Services and Routes in Kong First, we need to configure Kong with a Service and a Route for our microservice. In Kong’s terminology, a Service object represents an upstream microservice, and a Route defines how requests from clients map to that service. Kong will listen for client requests on the route and forward them to the specified service. For example, we create a declarative configuration file for Kong in DB-less mode. Below is a snippet of kong.yml defining a service and route, and then attaching the JWT authentication plugin to that service: YAML _format_version: "2.1" services: - name: my-api-service url: http://localhost:3000 # upstream microservice URL routes: - name: api-route service: my-api-service paths: - /api # clients will call /api on the gateway plugins: - name: jwt service: my-api-service enabled: true config: key_claim_name: kid # use 'kid' field in JWT header to identify the key claims_to_verify: - exp # ensure the 'exp' (expiry) claim is valid In this config, we map my-api-service to the upstream at localhost:3000, and we route any requests with path /api on the gateway to that service. The jwt plugin is enabled on the service, which means Kong will require a valid JWT on all requests to this service. We configured the plugin to check the token’s exp claim and to expect a kid claim in the JWT header to identify the signing key. After loading this config and starting Kong, any request to http://<gateway>:8000/api will be intercepted by Kong. Since the JWT plugin is active, Kong will attempt to find a JWT in the request and validate it. If no token is present or the token is invalid, Kong will respond with 401 Unauthorized. At this point, our microservice is protected only authenticated calls should be forwarded. 2. Configuring Credentials (JWT Issuers and Secrets) Now that Kong is blocking unauthorized requests, we need to configure who is considered an authorized consumer and what credentials (JWTs) are accepted. In a real system, you would likely integrate with an Identity Provider or authorization server that issues JWTs. Kong uses the concept of Consumers to represent clients or user identities that consume your APIs. Each consumer can have credentials associated with it. We will add a consumer entry and a JWT secret for that consumer in our kong.yml: YAML consumers: - username: auth-service jwt_secrets: - consumer: auth-service key: my-issuer-key-123 # 'kid' value expected in the JWT header secret: my-jwt-signing-secret In this snippet, we added a consumer named auth-service. We then added a JWT credential for that consumer with a secret and a key. The key is a unique identifier and the secret is the HMAC secret that this consumer will use to sign JWTs. Essentially, we are telling the gateway to accept JWTs signed with my-jwt-signing-secret, as long as they carry kid: my-issuer-key-123 in their header, and treat them as coming from the auth-service consumer. With this configuration, the JWT plugin knows how to verify incoming tokens: it will look at the kid claim in the JWT header to find the matching consumer’s secret, then verify the token’s signature using that secret. It also checks the exp claim to ensure the token has not expired. 3. Testing the Secured Gateway Now the secure gateway is configured. Let’s quickly illustrate the behavior with example requests: PowerShell # Attempt to call the service without any token $ curl -i http://localhost:8000/api HTTP/1.1 401 Unauthorized ... # Now, obtain or craft a JWT signed with 'my-jwt-signing-secret' and kid 'my-issuer-key-123'. # For demonstration, we can use an online tool or a JWT library to create a token: # Header: { "alg": "HS256", "typ": "JWT", "kid": "my-issuer-key-123" } # Payload: { "sub": "user123", "exp": <some future timestamp>, ... } # Sign it with the secret. # Call the API with the JWT in the Authorization header $ curl -i -H "Authorization: Bearer <your_jwt_token_here>" http://localhost:8000/api HTTP/1.1 200 OK Hello world! As shown, without a valid JWT, the gateway returns a 401 Unauthorized response. With a valid JWT, Kong will authenticate the request and route it to the upstream service, which returns the expected data (HTTP 200 OK). The microservice itself did not need to implement any auth checks; the gateway handled it. Conclusion Implementing a secure API gateway for microservices involves setting up a robust gateway solution and offloading security concerns to it. We demonstrated how to use Kong to enforce JWT authentication in front of a microservice. The gateway approach streamlines authentication across all services. Once a request is verified at the edge, microservices can trust that identity and operate in a zero-trust, defense-in-depth manner. Open source API gateways like Kong, Envoy, and Traefik provide the building blocks to authenticate and authorize traffic, handle encryption, and apply policies uniformly. By centralizing these concerns, engineering teams can avoid duplicating security code across microservices and instead manage it in one place. As a result, the overall system becomes easier to secure and maintain. For advanced scenarios, from integrating with enterprise SSO/IdPs to implementing multi-tenant auth or fine-grained access control, the gateway can be extended with plugins or external auth services. The key is to establish the gateway as the trust barrier between clients and your microservices. With a secure API gateway in place, a microservices architecture can achieve both agility and strong security compliance. More

Trend Report

Platform Engineering and DevOps

Platform engineering and DevOps are merging as organizations scale, modernize, and push to reduce cognitive load across increasingly complex systems. What began as fragmented internal tooling has evolved into Platform-as-a-Product thinking, where internal developer platforms (IDPs), automation pipelines, and golden paths provide the backbone of modern DevOps workflows. Platform teams, DevOps engineers, security teams, and SREs are now working together to deliver consistent, secure, and self-service experiences that improve developer productivity and satisfaction and reinforce operational reliability.This report examines how platform engineering is reshaping DevOps by standardizing environments, unifying toolchains, and shifting repetitive tasks into automated workflows. We explore how teams are implementing developer experience (DevEx) metrics, rethinking CI/CD pipelines, and leveraging AI-driven automation to optimize infrastructure performance and enhance delivery velocity. As enterprises link platform health to business outcomes, measuring ROI and platform adoption is becoming a core initiative.

Platform Engineering and DevOps

Refcard #403

Shipping Production-Grade AI Agents

By Vidyasagar (Sarath Chandra) Machupalli FBCS DZone Core CORE
Shipping Production-Grade AI Agents

Refcard #388

Threat Modeling Core Practices

By Apostolos Giannakidis DZone Core CORE
Threat Modeling Core Practices

More Articles

Pragmatica Aether: Let Java Be Java
Pragmatica Aether: Let Java Be Java

The Aberration We build Java applications like Go or Rust programs. Fat JARs. Docker images. Kubernetes deployments. Everyone does it, so it looks normal. It contradicts Java’s design DNA. Java has always been a language for managed environments. Applets ran inside browsers. Servlets ran inside application servers. EJBs ran inside containers like JBoss and WebLogic. OSGi bundles ran inside runtime containers like Eclipse Equinox. In every generation, the pattern was the same: a managed runtime hosts the application. The application handles business logic. The runtime handles infrastructure. The fat-jar era threw that away. We stopped letting Java be Java. We started bundling web servers, serialization frameworks, service discovery clients, configuration management, health checks, metrics libraries, and logging frameworks into every application. Then we wrapped the result in a Docker container and deployed it to an orchestration platform that reimplements — poorly — the infrastructure management that Java runtimes used to provide natively. This article introduces Pragmatica Aether: a distributed runtime that returns Java to its natural habitat. The application handles business logic. Runtime handles infrastructure. This isn’t radical — it's returning to what Java was designed for. The Problem: Infrastructure Wearing a Business Logic Mask Think of what a typical Java microservice carries. A web server (Tomcat, Netty, Undertow). A serialization framework (Jackson, Gson). A dependency injection container (Spring, Guice). A service discovery client (Eureka, Consul). Health check endpoints. Configuration management (Spring Cloud Config, Consul KV). A metrics library (Micrometer, Dropwizard). A logging framework (Logback, Log4j2). Retry logic (Resilience4j). Circuit breakers. HTTP client configuration. The application is wearing a heavy winter coat of infrastructure, armed to the teeth to survive in a hostile environment. Now consider the coupling this creates. Update the Java version — rebuild and test every service. Change your message broker from RabbitMQ to Kafka — modify, rebuild, and redeploy every application that touches messaging. Add a new observability tool and update dependencies in every microservice. Switch cloud providers — rewrite configuration, SDK calls, and deployment manifests across the entire fleet. Each change ripples through dozens or hundreds of services because infrastructure is entangled with business logic at the dependency level. This is the coupling trap. Your application’s pom.xml doesn't distinguish between business dependencies and infrastructure dependencies. They compile together, deploy together, and break together. A security patch in Netty requires a new build of every service that embeds a web server, which is all of them. Framework lock-in worsens this. It isn’t a vendor problem — it's an architecture problem. Spring’s dependency injection fights with Kubernetes service mesh for control over service routing and circuit breaking. The framework’s configuration system overlaps with Consul KV and Kubernetes ConfigMaps. Your cloud SDK’s retry logic conflicts with Resilience4j. Every layer claims authority over the same cross-cutting concerns, and the conflicts surface as subtle bugs in production — not during development. This is an architecture problem. Architectural problems have architectural solutions. Aether: The Core Idea What you write: an interface annotated with @Slice, plus business logic implementation. Java @Slice public interface OrderService { Promise<OrderResult> placeOrder(PlaceOrderRequest request); static OrderService orderService(InventoryService inventory, PricingEngine pricing) { return request -> inventory.check(request.items()) .flatMap(available -> pricing.calculate(available)) .map(priced -> OrderResult.placed(priced)); } } What you don’t write: everything else. No HTTP clients — inter-slice calls are direct method invocations via generated proxies. No service discovery — the runtime tracks where every slice instance lives. No retry logic — built-in retry with exponential backoff and node failover. No circuit breakers — the reliability fabric handles failure automatically. No serialization code — request/response types are serialized transparently. A method call via an imported interface is the only visible contract. The only hint that the actual call might be remote is a design requirement: slice methods should be idempotent. This isn’t a limitation — it's what enables retry, scaling, and fault tolerance to work transparently. The same request, processed by any available instance, produces the same result. Most read operations are naturally idempotent. For writes, standard patterns like idempotency keys and conditional writes handle it cleanly. Everything else is the environment’s job: resource provisioning, scaling, transport, discovery, retries, circuit breakers, configuration, observability, logging, tracing, monitoring, and security. None of these are application concerns, and none should be handled at the business logic level. The JBCT Leaf pattern serves two purposes here: it documents the design (“what we expect from an external implementation”) and encourages exactly one interface per dependency. Different implementations may have different technical properties — performance, latency, memory consumption — but as long as they’re compatible with the interface, business logic works unchanged. You write basically pure business logic that scales from your local computer to a global multi-zone distributed deployment, transparently. Under The Hood: What Makes It Work Five architectural decisions make this possible. Consensus KV Store. A single source of truth for all configuration, deployment state, and service discovery. Based on the Rabia protocol, a crash-fault-tolerant, leaderless consensus algorithm was published in 2021. Any node can propose; agreement is reached through a two-round voting protocol with a fast path when a supermajority agrees in round one. No external config servers. No etcd. No Consul. Configuration changes propagate through consensus and take effect cluster-wide. Built-in Artifact Repository. DHT-based storage with configurable replication — 3 replicas with quorum reads/writes in production, full replication in development. Artifacts are chunked into 64KB pieces, distributed across nodes via consistent hashing, and integrity-verified with MD5 and SHA-1 on every resolve. No external Nexus or Artifactory is needed. During development, slices resolve from your local Maven repository. In production, the cluster is self-contained. ClassLoader Isolation. Each slice runs inside its own SliceClassLoader with child-first delegation. Two slices can use different versions of the same library without conflict. Shared dependencies like Pragmatica Lite core are loaded once in a parent classloader. No dependency conflicts. No classpath hell between slices. Declarative Deployment. Blueprints — TOML files — describe the desired state: which slices, how many instances. TOML id = "org.example:commerce:1.0.0" [[slices]] artifact = "org.example:inventory-service:1.0.0" instances = 3 [[slices]] artifact = "org.example:order-processor:1.0.0" instances = 5 Apply with one command: aether blueprint apply commerce.toml. The cluster resolves artifacts, loads slices, distributes instances across nodes, registers routes, and starts serving traffic. The cluster converges to the desired state automatically. Infrastructure Independence. Aether nodes are identical — there's only one deployment artifact to manage at the infrastructure level. Node updates and application deployments run on completely independent schedules. Update Java — roll it out across nodes without touching applications. Update the Aether runtime — same. Update business logic — deploy new slice versions without touching infrastructure. Each independently, each without downtime. This is the fundamental benefit of proper separation: when layers don’t share a deployment unit, they don’t share a deployment schedule. Fault Tolerance: The 50% Rule The system survives the failure of less than half the nodes. Performance may degrade until replacements spin up, but functionality remains intact — actual redundancy, not just graceful degradation. A 5-node cluster tolerates 2 simultaneous failures. A 7-node cluster tolerates 3. The same request, processed by any available node, produces the same result. Quorum requires (N/2) + 1 nodes — as long as a majority is alive, the cluster operates normally. Leader failover is consensus-based and near-instant. Node replacement happens automatically — the Cluster Deployment Manager detects the deficit and provisions a replacement through the NodeProvider interface. The entire recovery sequence — from failure detection through state restoration to serving traffic — completes without human intervention. When a node fails, the recovery is automatic. Requests to slices on the failed node are immediately retried on healthy nodes. A replacement node is provisioned. It connects to peers, restores consensus state from a cluster snapshot, re-resolves artifacts from the DHT, and reactivates assigned slices. Dead nodes are automatically removed from routing tables. The new leader reconciles the stale state. No human intervention required. Rolling updates leverage this fault tolerance for zero-downtime deployments with weighted traffic routing: SQL aether update start org.example:order-processor 2.0.0 -n 3 aether update routing <id> -r 1:3 # 25% to v2, 75% to v1 aether update routing <id> -r 1:1 # 50/50 aether update complete <id> # 100% to v2, drain v1 Deploy during business hours. Shift traffic gradually — 10% canary, then 25%, 50%, 75%, 100%. Monitor health metrics at each step. If health degrades — error rate exceeds thresholds, latency spikes — instant rollback with one command: aether update rollback <id>. Traffic immediately shifts back to the old version. The 3 AM pager alert becomes an audit log entry. For Every Project: Legacy, Greenfield, And Everything Between Legacy Migration Your legacy Java system doesn’t need a complete rewrite. It needs a path forward. Pick a relatively independent part of your system — something hitting limits, something with clear boundaries. Extract an interface. Annotate it with @Slice. Wrap the legacy implementation: Java private Promise<Report> generateReport(ReportRequest request) { return Promise.lift(() -> legacyReportService.generate(request)); } One line to enter the Aether world. Promise.lift() wraps the legacy call, catches exceptions, and returns a proper Result inside a Promise. Your legacy code keeps running. Call sites don't change. You haven't added risk — the initial deployment in Ember runs in the same JVM as your existing application, which means it's no worse than what you have today. You've laid the foundation for removing risk, not adding it. Moving from Ember to a full Aether cluster is a configuration change, not a code change — and that's when the 50% rule starts to apply. From there, it’s the strangler fig pattern. Extract a hot path, deploy it as a slice, route traffic, repeat. Each extracted slice can be gradually refactored using the peeling pattern: first wrap everything in Promise.lift(), then decompose into a Sequencer with each step still wrapped, then peel individual steps into clean JBCT patterns. Tests pass at every step. The lift() calls mark exactly where legacy code remains, making progress visible and remaining work obvious. No rewrite is required. No big bang migration. One sprint to the first slice in production. The migration article covers the full path in detail — from initial wrapping through gradual peeling to clean JBCT code. Greenfield Development For new projects, slices enable a granularity that’s impossible with traditional microservices. Each slice can be as lean as a single method — and that’s the recommended approach. There are no operational or complexity tradeoffs for small slices because Aether handles all the infrastructure overhead. No container to configure, no load balancer to provision, no monitoring to set up per service. You get per-use-case scaling: one slice serving 50 instances during peak load while another idles at minimum. That kind of granularity would be operationally insane with traditional microservices — each needing its own container, load balancer, monitoring, and deployment pipeline. With Aether, it’s the default. JBCT patterns — Leaf, Sequencer, Fork-Join, Condition, Iteration, and Aspects — compose naturally within slices. Each slice method is a data transformation pipeline: parse input, gather data, process, respond. The patterns provide consistent structure within slices. Slices provide consistent boundaries between them. The Spectrum Same slice model, different granularity. A service slice wraps an entire legacy component. A lean slice implements a single method. Both coexist in the same cluster, deployed and scaled independently. Slice is the executable unit. It can be big or small as necessary and convenient. The architecture accommodates both monolith migration and greenfield development simultaneously. Your legacy system gains fault tolerance while new features get maximum deployment flexibility. Scaling: Two Levels, Three Tiers of Intelligence Two-Level Horizontal Scaling Aether scales in two dimensions independently: Slice scaling: Spin up more instances of a specific slice on existing nodes. Classes are already loaded—scaling takes milliseconds, not seconds.Node scaling: Add more machines to the cluster. The node connects, restores state, and begins accepting work. Independent controls, combined effect. Each node hosts at most one instance of a given slice, so scaling a slice beyond the current node count requires adding nodes first. Add 2 more nodes to a 3-node cluster, then scale a hot slice to 5 instances—one per node. No coordination between the two dimensions is required. Three-Tier Decision System Tier 1—Decision Tree (1-second intervals) Instant reactive decisions based on CPU utilization, request latency, queue depth, and error rate. CPU above 70%? Add an instance. Below 30% sustained? Remove one (if above minimum). Latency exceeding the P95 threshold? Scale up. Error rate above 1% due to timeouts? Scale up. Deterministic, predictable, fast. Handles routine load changes with configurable cooldown periods — 30 seconds for scale-up, 5 minutes for scale-down — to prevent oscillation. Tier 2—TTM Predictor (60-second intervals) An ONNX-based machine learning model (Tiny Time Mixers) analyzes a 60-minute sliding window of metrics — CPU usage, request rate, P95 latency, and active instances. Forecasts load and adjusts the Decision Tree’s thresholds preemptively. If TTM predicts a load increase, it lowers the scale-up CPU threshold by 20% so the reactive tier responds earlier. The cluster scales before the spike arrives, not after. The key design principle: the cluster always survives on Tier 1 alone. TTM enhances; it doesn’t replace. If TTM fails — model load error, insufficient data, inference failure — the Decision Tree continues with default thresholds. The error is logged and recorded in metrics. No scaling disruption. Tier 3—LLM-based (planned) Long-term capacity planning and cluster health monitoring. Seasonal pattern prediction, maintenance window planning, anomaly investigation. This tier is not yet implemented — the current system operates with Tiers 1 and 2. Fault tolerance makes preemptible instances viable for burst scaling. If a spot instance gets reclaimed, the cluster survives — it was designed for nodes to disappear. You don’t need a PhD in distributed systems or a dedicated platform team. The scaling system manages itself. Development Experience: From Laptop To Production Three Environments, Zero Code Changes Ember Single-process runtime with multiple cluster nodes running in the same JVM. Fast startup, simple debugging. Deploy your slices alongside your existing application — slices call each other directly in-process. No network overhead. Standard debugger breakpoints work as expected. Perfect for local development and unit testing. Forge A 5-node cluster simulator running on your laptop. Real consensus. Real routing. Real failure scenarios. Kill nodes, crash the leader, trigger rolling restarts — and watch the cluster recover in real time through a web dashboard with D3.js topology visualization, per-node metrics (CPU, heap, leader status), and event timeline. Configurable load generation with TOML-based multi-target configuration lets you stress-test realistic scenarios — set request rates, define body templates, and run duration-limited load tests. Chaos operations include node kill, leader kill, and rolling restart. Forge validates the entire dependency graph before starting anything. Aether Production cluster. Same slices, same code, different scale. Your code doesn’t know which environment it’s running in. Whether inter-slice calls are in-process or cross-network is transparent. Tooling 37 CLI commands cover deployment, scaling, updates, artifacts, observability, controller configuration, and alerts — in both single-command and interactive REPL modes. A web dashboard streams real-time metrics via WebSocket — no polling. 30+ REST management endpoints enable full programmatic control of everything the CLI can do. Prometheus-compatible metrics export (/metrics/prometheus) integrates with existing monitoring stacks. Metrics are push-based at 1-second intervals, with zero consensus overhead — they bypass the consensus protocol entirely. Per-method invocation tracking with P50/P95/P99 latency and configurable slow-invocation detection strategies (fixed threshold, adaptive, per-method, composite) surfaces performance issues before users notice. Dynamic aspects let you toggle LOG/METRICS/LOG_AND_METRICS modes per method at runtime via REST API, without redeployment. Test realistic failure scenarios on your laptop. Deploy to production with a config change, not a code change. Maturity Aether is a working system, not a concept paper. 81 end-to-end tests are run against real 5-node clusters in Podman containers, validating cluster formation, quorum establishment, slice deployment and scaling, blueprint application with topological ordering, multi-instance distribution, artifact upload, and cross-node resolution with integrity verification, leader failure and recovery, node restart with state restoration, and orphaned state cleanup after leader changes. The recovery and fault tolerance claims come from automated tests against real clusters, not marketing slides. Let Java Be Java Java’s lineage leads here. From applets managed by browsers, through servlets managed by application servers, through EJBs managed by enterprise containers, through OSGi managed by runtime frameworks, to Aether, managed by a distributed runtime. The fat-jar era was a detour. An understandable one — when Docker emerged, it offered a universal packaging format, and the industry standardized on it regardless of language. Java adopted the patterns of languages that were designed to produce standalone binaries. We started treating Java applications like Go programs with a heavier runtime. But it was never the destination. Java was designed for managed environments. The JVM makes it possible. The runtime manages the application. That’s the lineage. Aether continues it. Two entry points exist today. Wrap your legacy monolith behind a @Slice interface in one sprint and gain fault tolerance without rewriting anything. Or start fresh with maximum clarity — lean slices, explicit contracts, per-use-case scaling. Both paths converge on the same runtime, the same cluster, the same operational model. Both paths can coexist — legacy service slices and new lean slices running side by side. Fault tolerance is not an afterthought — it's the foundation. Scaling is not your problem — it's the environment’s. Infrastructure is not your code — it's the runtime’s. The heavy winter coat comes off. The application breathes. Resources Pragmatica Aether—project siteGitHub Repository—source code

By Sergiy Yevtushenko
Building a DevOps-Ready Internal Developer Platform: A Hands-On Guide to Golden Paths, Self-Service, and Automated Delivery Pipelines
Building a DevOps-Ready Internal Developer Platform: A Hands-On Guide to Golden Paths, Self-Service, and Automated Delivery Pipelines

Editor’s Note: The following is an article written for and published in DZone’s 2026 Trend Report, Platform Engineering and DevOps: How Internal Platforms, Developer Experience, and Modern DevOps Practices Accelerate Software Delivery. The role of the enterprise developer has become more complex over time as organizations adopt new technologies and tools, often without retiring their old ones. Add high staff turnover and increasing time and cost pressure, and developers are confronted with charting their own path through the SDLC. The purpose of internal developer platforms (IDPs) is to create a win-win scenario that benefits developers and their organizations. In this tutorial, you’ll define one golden path for a backend service that covers service setup, deployment, observability, and guardrails end to end. Step 1: Define the Platform Product and First Golden Path Successful IDP efforts focus on end-to-end developer workflows: building a new interface, deploying an updated microservice, running a regression suite, or standing up an environment. Ideally, the whole workflow can be supported directly from your IDP as self-service. Once you have identified the workflow to support, you need to design the “golden path,” which parts you will standardize and what you expose as configuration. It’s important to get that balance right. Components that have to change often, like service accounts, interfaces, and sizing, should be configurable. Creating templates and patterns helps reduce variability between outputs, making it easier to roll out necessary patching and updates. For the first golden path, pick one high-value workflow that is common, repeatable, and easy to measure. We will use the deployment of our backend service to an integration test environment because it touches build, deployment, validation, and evidence capture in one flow. User adoption is the key to success. To measure, it’s important to track both user adoption, such as how often a workflow is triggered, and outcome metrics like the number of compliant application instances, percentage of deployment failures, and average deployment duration. Step 2: Design the Golden Path (Templates and Defaults) Next, we get to design the golden path. An important factor for the developer experience is to provide documentation with contextual guidance. This can be traditional how-to guides or more advanced features such as AI-enabled chatbots. The documentation should explain how testing, application deployments, and other lifecycle activities happen along the golden path, and provide architectural guidance on embedding any newly developed capability in the existing architecture. Standards and governance are other aspects that should be available for self-service, including naming conventions, common libraries, and reusable services. On the technical side, the golden path should cover at least the following: Code repo and standard branching structureSkeleton code based on coding standards (e.g., environment config file, logging framework, data layer)CI/CD pipeline into an ephemeral cloud environment, or pointed at a standard persistent dev environmentSkeleton quality gates in the CI/CD pipeline (e.g., unit test, functional regression, security scan)Access to common utilities; injection of environment values (e.g., URLs, IP addresses, access and secrets management)Ability to spin up the environment (if cloud based) And lastly, the IDP needs to be designed with intuitive naming, a search function, tagging methods, and a hierarchical browsing structure so users can easily find the appropriate golden path. Supporting multiple ways of discovery provides a more resilient interface and eases the adoption of new golden path templates as they become available. For our backend service, choosing the workflow will show a representation of the steps included. Step 3: Wire Self-Service Workflows (Without Tickets) Besides golden path templates, IDPs should aim to be a one-stop shop for developers, so common requests should be available for self-service. Your existing ticket/ITSM systems can be a good source for creating the backlog. Identify the most common requests and start automating them in priority order. In many cases, a ticket continues to be useful even in the self-service model for tracking and approvals, which can be integrated into the automatic workflow. Approvals should be provided automatically based on defined criteria, and only require human approvals when the request is outside of those parameters, such as access to restricted data, use of expensive resources, and non-standard requests. Over time, developers should be able to request new features through a transparent feature backlog and voting mechanism to engage the community. When creating new features, keep things common wherever possible and provide ways for users to tailor their requests. For example, the standard deployment process might define a step for secrets injection, but some teams will tailor the process to skip it as necessary. This approach has two advantages: It creates a common language and process across teams and reduces the work to build and maintain the IDP. Spending a bit more time up front to create customizability pays off over the medium and long term. For our backend service, the first service we define is deployment to the integrated test environment. Step 4: Standardize Delivery With CI/CD + GitOps + IaC in One Flow The principle of the golden path deployment process remains unchanged: You build a software artifact once, and you deploy it multiple times along the environment path. For our backend service, promotion should happen through a versioned change (think GitOps) to the desired environment state, so application version, infrastructure definition, and deployment evidence remain traceable together. In the build stage, code is prepared in any pre-compile steps, then compiled and packaged with all necessary configuration files. In the deployment process, environment variables are injected, and the package is deployed to the target environment, which is scripted as Infrastructure as Code. The validation itself is usually layered: a technical validation to confirm that the deployment was correct, functional regression of core functionality, and testing the new changes. This sequence is based on speed of feedback, which is important in an automated IDP service. When a validation check fails, the golden path needs to have defined failure behavior with clear steps to execute. Pipeline failures like a broken build, failed test, or policy violation will block progression automatically. If the environment is materially impacted, a rollback is automatically initiated. Only in rare cases should a human evaluation be required — for example, when the level of ambiguity is too high and impacts stakeholders who are using the environment. Some policy violations can be treated with time-bound exceptions, such as allowing a new security vulnerability in a non-production environment. This allows functional testing to continue while the team remediates the security vulnerability. Prior to going live, the exception would be removed so the security vulnerability doesn’t progress to production. These types of exceptions should be set to auto-expire to prevent them from being forgotten later. Golden Path Steps and Guardrails stepself-service actionguardrailevidence Build Trigger pipeline via check-in action in source control Code scan and unit test results Build log, composition scan result Promote to non-prod environment Merge to staging branch, promotion request Technical validation, regression test Test results Promote to prod Promotion request Approval and compliance check Approval and audit trail Rollback Automated trigger or manual request Post-rollback validation and regression test Test results Step 5: Bake in Operability for Observability and Day-2 Readiness IDPs reduce cognitive load and toil as solutions to common concerns are built in. This is especially true for the operational concerns. Each workflow and self-service feature creates the log files and traces for auditability. All code and configuration are driven from version control, and the metrics recorded provide insights into the outcomes and performance of the IDP. New operational initiatives, like introducing a software bill of materials, can be rolled out across all technologies that use the IDP. When done correctly, templates can be updated centrally, and the log files provide full auditability to identify where old versions are still in use, reducing the overall security exposure. The IDP governance model needs to define the ownership of templates and any inheritance rules. For instance, some teams will tailor the template by adding additional steps required for their technology. Alongside the IDP instrumentation, standard dashboards and alert definitions ship with the template, pre-wired to the appropriate ownership group. Who responds to what is documented, not assumed. Runbooks and escalation paths are stored in version control alongside the service itself so they evolve with the system rather than rotting in a forgotten wiki page. Our backend service will include the following with the golden path: Logs, metrics, and tracesAlertsRunbook linkOwnership metadata The final piece is the feedback loop. Incidents, near-misses, and recurring friction points are resolved and also used to help continuously improve the platform, first becoming a backlog item. Step 6: Add Guardrails and Governance Without Slowing Delivery The IDP should leverage approved templates where possible and embed basic compliance and policy checks in the workflows. Platform developers will receive immediate feedback on any problems they need to fix. When issue resolution requires a longer time, time-bound exceptions can be allowed. Along the environment path from development to production, the quality gates should become more restrictive as the software quality improves. For our backend service, we define security scanning prior to deployments, and we don’t accept any deviations from the corporate standard for it. We follow a simple block, warn, escalate paradigm. The goal is to address problems that teams can deal with immediately and provide enough time for more complex work. This balance allows work to flow at pace. It is important to version templates and workflows so you can track what is in use. When significant problems are identified with a version, you can use the IDP logs to find any items in use and replace them quickly. Having the right guardrails in place might feel restrictive but in fact reduces the amount of rework over time as there are fewer incidents. Fast feedback reduces the time it takes to resolve problems. Step 7: Measure Adoption, DevEx, and Platform ROI One of the key success factors for IDPs is having the ability to measure adoption (covered earlier), developer experience, and platform ROI (e.g., DORA, SPACE). This allows you to break down and distinguish between adoption measures and outcome metrics. Implementing these criteria in the platform from the beginning captures data systematically. Good adoption measures to start with: number of executed workflows, number and currency of templates, and number of active users. The following outcome metrics can also be used as part of the business case for IDPs: deployment failure rate, MTTR, incident volumes, number of tickets, and security vulnerabilities. The team managing the IDP should actively use the metrics together with captured feedback from the user base (e.g., feature requests) to prioritize the backlog. Executive dashboards should be implemented to provide accountability and increase support across the organization. A Minimal IDP You Can Scale Bringing it together, take the following actions to kick-start your internal developer platform: Choose a common and not too complex workflow for your first golden pathCreate the code repository and CI/CD pipelineDefine a self-service UI for the workflowEmbed quality gates, metrics, and operational tooling into the workflow Start with one workflow for one pilot team, prove the path, then extend to the next workflow or team. Don’t forget to engage with the pilot users to receive feedback and support adoption. If you want to dive deeper, explore the CNCF Platforms for Cloud-Native Computing whitepaper and Platform Engineering Maturity Model. This is an excerpt from DZone’s 2026 Trend Report, Platform Engineering and DevOps: How Internal Platforms, Developer Experience, and Modern DevOps Practices Accelerate Software Delivery.Read the Free Report

By Mirco Hering DZone Core CORE
LLM-Powered Deep Parsing for Industrial Inventory Search
LLM-Powered Deep Parsing for Industrial Inventory Search

Industrial ERPs often look structured on the surface: item IDs, purchase orders, stock levels. But in many companies, they are overloaded with unintentional duplicates because the most important information is buried inside an unstructured description field. In heavy industry, descriptions are entered manually with little guidance or validation, so the same part shows up many times under different names. A single component might live three separate lives in the system: “Bosch Pump”“Pump, Bosch, 1500w”“Pmp Bsch hydraulic” A domain expert would usually conclude that, with very high probability, these items refer to the same piece of equipment. To an ERP, these items are three different products. Multiply this by millions of line items, multiple sites, multiple languages, and years of legacy records, and you get a familiar outcome: data exists, but it is often too inconsistent to use efficiently. In practice, this impacts two workflows that most teams care about: Search: Teams can’t reliably find what is already in storage, especially when they don’t know the exact naming used in the ERP.Deduplication: Procurement keeps purchasing “new” parts that already exist, and dead stock accumulates over time. A common goal is to make descriptions usable for downstream automation by extracting a homogeneous, decision-ready structure from raw text. This is where LLM-powered deep parsing can be useful. LangChain is a practical framework for implementing this as a repeatable, testable pipeline rather than a collection of ad-hoc prompts. Traditional Approaches and Where They Tend to Break Full String Matching Works when descriptions are clean and strictly validated. In real operations, that level of discipline is not common, so full matching is usually a starting point rather than a long-term solution. Rules and Regex Safe, deterministic, explainable. Also fragile. Descriptions differ by plant, vendor, technician, language, and era. As the scope expands to new item categories, maintaining rules often becomes a major effort, and coverage gaps show up quickly. Approximate String Matching Useful for typos. Less useful for meaning. It won’t reliably capture that “hydraulic” and “liquid pressure” are related contexts, and it often struggles with short abbreviations that carry important signals. Selecting and tuning fuzzy matching across many categories is usually a constant trade-off. Semantic Search Semantic search converts a query and each item into embeddings and retrieves nearest neighbors by vector distance in a similar way to how distance is calculated between two locations on a map. This is helpful in messy industrial text because it improves recall even when wording differs. But industrial data is a place where small textual differences can imply big functional differences. A wrench 11mm and a wrench 21mm may sit close in semantic space because they share most context (“wrench”, tool category, usage), yet they are not interchangeable. For this reason, semantic search alone is usually not sufficient for reliable deduplication or for a strict “find exact part” search. It can help find candidates, but you typically still need a precision layer based on extracted attributes. What Deep Parsing Changes in Practice Efficient industrial search is not only about typos and semantics. Most of the time, it requires extracting the characteristics that drive interchangeability: Manufacturer/brandCategory (bolt, bearing, motor, pump)Key specs (dimensions, voltage, pressure, thread type, strength class)Normalized unitsStandards and constraints relevant to the business LLMs can help because they can follow high-level instructions like “extract manufacturer and relevant specs” while handling inconsistent phrasing, abbreviations, and word order. The catch is that LLMs are not “born” with your domain knowledge. They may not know internal abbreviations, preferred naming conventions, or which standard matters for a specific category. Fine-tuning a large model is often infeasible, and frequently not the most cost-effective path. A practical approach is to provide domain context at runtime (RAG) and enforce structured outputs with validation. Parsing and Processing Pipeline Raw Item Descriptions + Metadata Input is usually not just a free-text description. Metadata often matters for correct interpretation and downstream decisions: Item category (if available)Vendor or manufacturer field (often partially filled)Unit of measure, procurement group, material group This metadata is not always reliable, but even imperfect signals can help with routing and validation. Schema Generation Industrial catalogs are not one schema. A bolt, a bearing, and an electric motor have different attributes. Trying to force everything into one universal schema typically produces either a huge, sparse structure or a generic output that is not very usable. A practical alternative is to generate a category-aware schema: Choose or infer the category (via metadata or a lightweight LLM classification)Let the LLM generate a category-specific template, what fields matter for this category. RAG can help here with instructions and guidelines that will make this crucial part more specific.Add attribute dictionaries via RAG (allowed values, synonyms, common aliases, unit conventions) At this point, your output is a Pydantic schema that was generated by LLM with the help of internal company knowledge. LLM Parsing Once it is known what needs to be extracted, the parsing step focuses on turning raw text into a typed record, keeping the output consistent across items in the same category thanks to schemes generated earlier. Two practical points tend to matter: Abbreviations and internal naming conventions are often the hardest part of the description. RAG can help to ease the pain hereStandards and additional instructions help the model disambiguate what a term means in context The output should be structured and typed (numbers as numbers, units normalized, enums constrained), not a free-form explanation. For example, LangChain handles that via structured output internally. Validation Parsing alone is not enough for operational use. You usually need validation and a decision layer, especially if the output drives deduplication. In many deployments, the match decision is not “LLM decides duplicates.” It is “LLM extracts structure, rules decide,” with an optional LLM contribution for borderline cases or for generating human-readable rationales for review queues. Parsed output The final output is a normalized JSON-like record you can store alongside the original description. It can also include confidence indicators and validation flags. This parsed layer becomes useful across multiple workflows: Structured filters and faceted search (category, manufacturer, normalized specs)Better deduplication at ingestion time (flag likely duplicates before creating a new item)Reporting and governance (how much of the catalog has missing critical attributes)Downstream inventory optimization, where “same part” needs to be defined consistently Common Failure Points Hallucinations LLMs can invent values, especially when inputs are thin or ambiguous. In practice, this is usually handled by enforcing strict schemas, grounding the model with retrieved domain context, running deterministic validators for units and plausible ranges, and routing low-confidence outputs to human review instead of letting the pipeline silently guess. Data Quality Threshold If the description is something like “Bearing 10”, there may not be enough signal to extract a meaningful structure. The practical approach here is to treat deep parsing as decision support rather than a guaranteed reconstruction of missing data. It helps to store uncertainty explicitly (for example, leaving required fields empty with a reason) and route such items into enrichment workflows using supplier documentation, historical purchase orders, or technician prompts. Summary For industrial inventory, deep parsing is most effective when treated as a structured extraction pipeline: raw descriptions and metadata are converted into a category-aware schema, parsed by an LLM, validated with rules, and written out as normalized structured data. RAG supports each stage with the right knowledge (templates and dictionaries for schema generation, abbreviations and standards for parsing, tolerances and edge cases for validation). LangChain is a practical way to orchestrate this end-to-end with enforced output contracts, retrieval where it matters, and enough testing and tracing to operate it in production, resulting in a structured layer that typically improves search and enables scalable deduplication without an ERP rewrite.

By Andrey Chubin
Contract-First Integration: Building Scalable Systems With Flyway, OpenAPI, and Kafka
Contract-First Integration: Building Scalable Systems With Flyway, OpenAPI, and Kafka

After implementing contract-first integration across three different microservices architectures, I've learned that the biggest bottleneck in distributed systems isn't technical; it's coordination between teams. When Team A waits for Team B to finish their API before starting integration work, you're throwing away weeks of productivity. Contract-first development flips this model. By defining your integration contracts upfront (OpenAPI specs, Avro schemas, database migrations), you enable teams to work in parallel, catch breaking changes early through CI validation, and treat contracts as the single source of truth. This isn't theoretical; this is how Netflix, Uber, and Amazon scale their engineering organizations. In this article, I'll show you production-ready contract-first patterns using Java 21, Spring Boot 3, OpenAPI 3, Apache Kafka with Avro, and Flyway migrations. You'll see real code from a working system that handles the three critical integration boundaries: REST APIs, event-driven messaging, and database schemas. The Problem: Why Traditional Integration Fails at Scale When systems integrate without contracts, you hit three major problems: 1. Serial development bottlenecks: Team A builds an API endpoint. Team B waits. Team B builds a consumer. Team A discovers the payload doesn't match what Team B expected. Both teams spend days debugging mismatched assumptions. 2. Late discovery of breaking changes: You deploy a service update that changes a response field from customerId to customer_id. Your API consumers break in production. No tests caught it because there was no contract to validate against. 3. Documentation drift: The Swagger docs say the endpoint returns a 201. The actual code returns a 200. The integration tests expect a 404. Nobody knows which one is right because there's no single source of truth. Contract-first development solves all three by making the contract the authoritative specification that generates code, mocks, tests, and documentation. What Contract-First Actually Means Contract-first means you define the integration boundary first (the contract) and then write code that conforms to it. The contract is not an afterthought or generated documentation. It's the design specification. A complete contract includes: Operations: Endpoints (REST), topics (Kafka), or tables (database)Data shapes: Request/response DTOs, event schemas, column definitionsValidation rules: Required fields, constraints, data typesError model: HTTP status codes, error payloads, dead-letter queuesNon-functional rules: Idempotency, retries, compatibility policies, SLAs Here's the mental model: Agree on the contract → generate tools → build independently → let CI enforce alignment. Contract Type 1: REST API Contracts With OpenAPI OpenAPI 3 specs are the gold standard for REST API contracts. You define endpoints, request/response schemas, validation rules, and error responses in YAML. Then you generate server stubs, client SDKs, mocks, and documentation from that single source. OpenAPI Contract Example Here's a production OpenAPI contract for an order management API: YAML openapi: 3.2.0 info: title: Orders API version: 1.0.0 description: Contract-first REST API for order management paths: /v1/orders: post: operationId: createOrder summary: Create a new order requestBody: required: true content: application/json: schema: $ref: '#/components/schemas/CreateOrderRequest' responses: '201': description: Order created successfully content: application/json: schema: $ref: '#/components/schemas/OrderResponse' '400': description: Validation error content: application/json: schema: $ref: '#/components/schemas/ErrorResponse' '409': description: Idempotency conflict content: application/json: schema: $ref: '#/components/schemas/ErrorResponse' /v1/orders/{orderId}: get: operationId: getOrder parameters: - name: orderId in: path required: true schema: type: string responses: '200': description: Order found content: application/json: schema: $ref: '#/components/schemas/OrderResponse' '404': description: Order not found components: schemas: CreateOrderRequest: type: object required: [customerId, items] properties: customerId: type: string example: CUST-123 idempotencyKey: type: string description: Optional key for safe retries items: type: array minItems: 1 items: $ref: '#/components/schemas/OrderItem' OrderItem: type: object required: [sku, quantity] properties: sku: type: string example: SKU-001 quantity: type: integer minimum: 1 OrderResponse: type: object required: [orderId, customerId, status, items, timestamp] properties: orderId: type: string customerId: type: string status: type: string enum: [CREATED, REJECTED] items: type: array items: $ref: '#/components/schemas/OrderItem' timestamp: type: string format: date-time ErrorResponse: type: object required: [code, message, traceId, timestamp] properties: code: type: string enum: [VALIDATION_ERROR, NOT_FOUND, CONFLICT, INTERNAL_ERROR] message: type: string traceId: type: string timestamp: type: string format: date-time Implementing the Provider Side (Spring Boot) The contract drives the implementation. Your Spring Boot controller implements what the contract specifies: Java @RestController @RequestMapping("/v1/orders") @RequiredArgsConstructor @Slf4j public class OrderController { private final OrderService orderService; /** * POST /v1/orders * Contract: contracts/openapi/orders-api.v1.yaml */ @PostMapping public ResponseEntity<OrderResponse> createOrder( @Valid @RequestBody CreateOrderRequest request) { log.info("Creating order for customer: {}", request.customerId()); OrderResponse response = orderService.createOrder(request); return ResponseEntity .status(HttpStatus.CREATED) .body(response); } /** * GET /v1/orders/{orderId} */ @GetMapping("/{orderId}") public ResponseEntity<OrderResponse> getOrder(@PathVariable String orderId) { OrderResponse response = orderService.getOrder(orderId) .orElseThrow(() -> new ResourceNotFoundException("Order not found: " + orderId)); return ResponseEntity.ok(response); } } Critical implementation details: DTOs match contract schemas exactly: CreateOrderRequest and OrderResponse are Java records generated from or validated against the OpenAPI specHTTP status codes match contract: 201 for creation, 404 for not found, 409 for idempotency conflictsValidation is enforced: @Valid annotation ensures request validation matches OpenAPI constraintsError responses are standardized: All errors return ErrorResponse with consistent structure Consumer Parallel Development Here's where contract-first shines. While your team implements the provider, the consumer team can: Generate a Java client from the OpenAPI spec using openapi-generatorRun a mock server that returns valid responses based on the contractWrite integration tests against the mockSwitch to the real service when it's ready (no code changes needed) The consumer doesn't wait for you to finish. They develop in parallel. Contract Type 2: Event Contracts With Kafka and Avro Event-driven systems need two contract layers: topic semantics (human-readable) and schema definitions (machine-validated). Topic Semantics Contract Document the operational contract for each topic: Markdown ## Topic: orders.order-created.v1 - **Purpose**: Emitted when an order is created successfully - **Key**: orderId (partition affinity per order) - **Delivery**: At-least-once (consumers must be idempotent) - **Consumer requirement**: Deduplicate by eventId - **Retry policy**: Consumer retries transient errors - **DLQ**: orders.order-created.v1.dlq for poison messages - **Compatibility**: Backward compatible schema evolution required Avro Schema Contract The Avro schema is your machine-validated contract: Java { "type": "record", "name": "OrderCreated", "namespace": "com.acme.events", "doc": "Event emitted when an order is successfully created", "fields": [ { "name": "eventId", "type": "string", "doc": "Unique event ID for idempotent processing" }, { "name": "occurredAt", "type": "string", "doc": "ISO 8601 timestamp" }, { "name": "orderId", "type": "string" }, { "name": "customerId", "type": "string" }, { "name": "source", "type": ["null", "string"], "default": null, "doc": "Order source (WEB, MOBILE, API). Nullable for backward compatibility." }, { "name": "items", "type": { "type": "array", "items": { "type": "record", "name": "OrderItem", "fields": [ {"name": "sku", "type": "string"}, {"name": "quantity", "type": "int"} ] } } } ] } Key pattern: The source field is nullable with a default value. This supports backward-compatible evolution; old consumers can read new events, and new consumers can read old events. Kafka Producer Implementation Java @Component @RequiredArgsConstructor @Slf4j public class OrderEventPublisher { private final KafkaTemplate<String, Object> kafkaTemplate; public void publishOrderCreated(OrderCreated event) { String key = event.getOrderId(); log.debug("Publishing OrderCreated: orderId={}, eventId={}", key, event.getEventId()); CompletableFuture<SendResult<String, Object>> future = kafkaTemplate.send("orders.order-created.v1", key, event); future.whenComplete((result, ex) -> { if (ex == null) { log.info("Published OrderCreated: orderId={}, partition={}, offset={}", key, result.getRecordMetadata().partition(), result.getRecordMetadata().offset()); } else { log.error("Failed to publish OrderCreated: orderId={}", key, ex); } }); } } Production concerns addressed: Key-based partitioning: Using orderId as the key ensures all events for the same order go to the same partition, maintaining orderingAsync publishing with callbacks: Non-blocking publish with explicit success/failure handlingStructured logging: Captures partition and offset for troubleshooting Kafka Consumer With Idempotency At-least-once delivery means duplicates are possible. Consumers must deduplicate: Java @KafkaListener(topics = "orders.order-created.v1", groupId = "billing-service") public void onOrderCreated(OrderCreated event) { // Check if already processed if (processedEventsRepository.existsByEventId(event.getEventId())) { log.debug("Skipping duplicate event: {}", event.getEventId()); return; } // Process event billingService.createInvoice( event.getOrderId(), event.getCustomerId(), event.getItems() ); // Mark as processed processedEventsRepository.save( new ProcessedEvent(event.getEventId(), Instant.now()) ); } Idempotency pattern: Check eventId before processing, store it after processing. If the same event arrives twice, the second one is ignored. Architecture Diagram: Contract-First Flow After implementing contract-first integration across three different microservices architectures, I've learned that the biggest bottleneck in distributed systems isn't technical; it's coordination between teams. When Team A waits for Team B to finish their API before starting integration work, you're throwing away weeks of productivity. Contract-first development flips this model. By defining your integration contracts upfront (OpenAPI specs, Avro schemas, database migrations), you enable teams to work in parallel, catch breaking changes early through CI validation, and treat contracts as the single source of truth. This isn't theoretical; this is how Netflix, Uber, and Amazon scale their engineering organizations. In this article, I'll show you production-ready contract-first patterns using Java 21, Spring Boot 3, OpenAPI 3, Apache Kafka with Avro, and Flyway migrations. You'll see real code from a working system that handles the three critical integration boundaries: REST APIs, event-driven messaging, and database schemas. The Problem: Why Traditional Integration Fails at Scale When systems integrate without contracts, you hit three major problems: 1. Serial development bottlenecks: Team A builds an API endpoint. Team B waits. Team B builds a consumer. Team A discovers the payload doesn't match what Team B expected. Both teams spend days debugging mismatched assumptions. 2. Late discovery of breaking changes: You deploy a service update that changes a response field from customerId to customer_id. Your API consumers break in production. No tests caught it because there was no contract to validate against. 3. Documentation drift: The Swagger docs say the endpoint returns a 201. The actual code returns a 200. The integration tests expect a 404. Nobody knows which one is right because there's no single source of truth. Contract-first development solves all three by making the contract the authoritative specification that generates code, mocks, tests, and documentation. What Contract-First Actually Means Contract-first means you define the integration boundary first (the contract) and then write code that conforms to it. The contract is not an afterthought or generated documentation. It's the design specification. A complete contract includes: Operations: Endpoints (REST), topics (Kafka), or tables (database)Data shapes: Request/response DTOs, event schemas, column definitionsValidation rules: Required fields, constraints, data typesError model: HTTP status codes, error payloads, dead-letter queuesNon-functional rules: Idempotency, retries, compatibility policies, SLAs Here's the mental model: Agree on the contract → generate tools → build independently → let CI enforce alignment. Contract Type 1: REST API Contracts With OpenAPI OpenAPI 3 specs are the gold standard for REST API contracts. You define endpoints, request/response schemas, validation rules, and error responses in YAML. Then you generate server stubs, client SDKs, mocks, and documentation from that single source. OpenAPI Contract Example Here's a production OpenAPI contract for an order management API: YAML openapi: 3.2.0 info: title: Orders API version: 1.0.0 description: Contract-first REST API for order management paths: /v1/orders: post: operationId: createOrder summary: Create a new order requestBody: required: true content: application/json: schema: $ref: '#/components/schemas/CreateOrderRequest' responses: '201': description: Order created successfully content: application/json: schema: $ref: '#/components/schemas/OrderResponse' '400': description: Validation error content: application/json: schema: $ref: '#/components/schemas/ErrorResponse' '409': description: Idempotency conflict content: application/json: schema: $ref: '#/components/schemas/ErrorResponse' /v1/orders/{orderId}: get: operationId: getOrder parameters: - name: orderId in: path required: true schema: type: string responses: '200': description: Order found content: application/json: schema: $ref: '#/components/schemas/OrderResponse' '404': description: Order not found components: schemas: CreateOrderRequest: type: object required: [customerId, items] properties: customerId: type: string example: CUST-123 idempotencyKey: type: string description: Optional key for safe retries items: type: array minItems: 1 items: $ref: '#/components/schemas/OrderItem' OrderItem: type: object required: [sku, quantity] properties: sku: type: string example: SKU-001 quantity: type: integer minimum: 1 OrderResponse: type: object required: [orderId, customerId, status, items, timestamp] properties: orderId: type: string customerId: type: string status: type: string enum: [CREATED, REJECTED] items: type: array items: $ref: '#/components/schemas/OrderItem' timestamp: type: string format: date-time ErrorResponse: type: object required: [code, message, traceId, timestamp] properties: code: type: string enum: [VALIDATION_ERROR, NOT_FOUND, CONFLICT, INTERNAL_ERROR] message: type: string traceId: type: string timestamp: type: string format: date-time Implementing the Provider Side (Spring Boot) The contract drives the implementation. Your Spring Boot controller implements what the contract specifies: Java @RestController @RequestMapping("/v1/orders") @RequiredArgsConstructor @Slf4j public class OrderController { private final OrderService orderService; /** * POST /v1/orders * Contract: contracts/openapi/orders-api.v1.yaml */ @PostMapping public ResponseEntity<OrderResponse> createOrder( @Valid @RequestBody CreateOrderRequest request) { log.info("Creating order for customer: {}", request.customerId()); OrderResponse response = orderService.createOrder(request); return ResponseEntity .status(HttpStatus.CREATED) .body(response); } /** * GET /v1/orders/{orderId} */ @GetMapping("/{orderId}") public ResponseEntity<OrderResponse> getOrder(@PathVariable String orderId) { OrderResponse response = orderService.getOrder(orderId) .orElseThrow(() -> new ResourceNotFoundException("Order not found: " + orderId)); return ResponseEntity.ok(response); } } Critical implementation details: DTOs match contract schemas exactly: CreateOrderRequest and OrderResponse are Java records generated from or validated against the OpenAPI specHTTP status codes match contract: 201 for creation, 404 for not found, 409 for idempotency conflictsValidation is enforced: @Valid annotation ensures request validation matches OpenAPI constraintsError responses are standardized: All errors return ErrorResponse with consistent structure Consumer Parallel Development Here's where contract-first shines. While your team implements the provider, the consumer team can: Generate a Java client from the OpenAPI spec using openapi-generatorRun a mock server that returns valid responses based on the contractWrite integration tests against the mockSwitch to the real service when it's ready (no code changes needed) The consumer doesn't wait for you to finish. They develop in parallel. Contract Type 2: Event Contracts With Kafka and Avro Event-driven systems need two contract layers: topic semantics (human-readable) and schema definitions (machine-validated). Topic Semantics Contract Document the operational contract for each topic: Markdown ## Topic: orders.order-created.v1 - **Purpose**: Emitted when an order is created successfully - **Key**: orderId (partition affinity per order) - **Delivery**: At-least-once (consumers must be idempotent) - **Consumer requirement**: Deduplicate by eventId - **Retry policy**: Consumer retries transient errors - **DLQ**: orders.order-created.v1.dlq for poison messages - **Compatibility**: Backward compatible schema evolution required Avro Schema Contract The Avro schema is your machine-validated contract: JSON { "type": "record", "name": "OrderCreated", "namespace": "com.acme.events", "doc": "Event emitted when an order is successfully created", "fields": [ { "name": "eventId", "type": "string", "doc": "Unique event ID for idempotent processing" }, { "name": "occurredAt", "type": "string", "doc": "ISO 8601 timestamp" }, { "name": "orderId", "type": "string" }, { "name": "customerId", "type": "string" }, { "name": "source", "type": ["null", "string"], "default": null, "doc": "Order source (WEB, MOBILE, API). Nullable for backward compatibility." }, { "name": "items", "type": { "type": "array", "items": { "type": "record", "name": "OrderItem", "fields": [ {"name": "sku", "type": "string"}, {"name": "quantity", "type": "int"} ] } } } ] } Key pattern: The source field is nullable with a default value. This supports backward-compatible evolution; old consumers can read new events, and new consumers can read old events. Kafka Producer Implementation Java @Component @RequiredArgsConstructor @Slf4j public class OrderEventPublisher { private final KafkaTemplate<String, Object> kafkaTemplate; public void publishOrderCreated(OrderCreated event) { String key = event.getOrderId(); log.debug("Publishing OrderCreated: orderId={}, eventId={}", key, event.getEventId()); CompletableFuture<SendResult<String, Object>> future = kafkaTemplate.send("orders.order-created.v1", key, event); future.whenComplete((result, ex) -> { if (ex == null) { log.info("Published OrderCreated: orderId={}, partition={}, offset={}", key, result.getRecordMetadata().partition(), result.getRecordMetadata().offset()); } else { log.error("Failed to publish OrderCreated: orderId={}", key, ex); } }); } } Architecture Diagram: Contract-First Flow Figure 1: Contract-first development flow showing parallel team development Note: Solid arrows represent generation or implementation flow. Dashed arrows represent validation relationships. Contract Type 3: Database Contracts With Flyway Contract Type 3: Database Contracts With Flyway Database schemas are contracts, too. Flyway migrations let you version and evolve schemas with the same contract-first approach. Base Schema Migration File: V1__create_orders.sql SQL CREATE TABLE orders ( id VARCHAR(32) PRIMARY KEY, customer_id VARCHAR(32) NOT NULL, status VARCHAR(16) NOT NULL, created_at TIMESTAMP NOT NULL DEFAULT now() ); CREATE TABLE order_items ( order_id VARCHAR(32) NOT NULL REFERENCES orders(id), sku VARCHAR(64) NOT NULL, quantity INT NOT NULL CHECK (quantity > 0), PRIMARY KEY (order_id, sku) ); CREATE INDEX idx_orders_customer_id ON orders(customer_id); Schema Evolution: Expand/Migrate/Contract When you need to add a field without breaking old code, use the expand/migrate/contract pattern: Expand (V2__add_order_source.sql): SQL ALTER TABLE orders ADD COLUMN source VARCHAR(32); Migrate (application code): SQL -- Backfill for existing rows UPDATE orders SET source = 'UNKNOWN' WHERE source IS NULL; Contract (V3__enforce_source_not_null.sql): SQL -- After all systems produce source ALTER TABLE orders ALTER COLUMN source SET NOT NULL; This three-phase approach lets you deploy schema changes without downtime. CI/CD Enforcement: Making Contracts Real Contracts only work if they're enforced. Here are the CI gates that prevent breaking changes: REST API: OpenAPI Breaking Change Detection Run openapi-diff in CI to catch breaking changes: YAML # .github/workflows/api-contract-check.yml - name: Check for API breaking changes run: | npx openapi-diff \ main:contracts/openapi/orders-api.v1.yaml \ HEAD:contracts/openapi/orders-api.v1.yaml \ --fail-on-breaking This fails the build if you: Remove required fieldsChange field typesRemove endpointsChange HTTP status codes Kafka: Schema Registry Compatibility Check Configure Schema Registry to enforce backward compatibility: YAML # Schema Registry configuration confluent.schema.registry.url=http://schema-registry:8081 spring.kafka.properties.schema.registry.url=http://schema-registry:8081 # Enforce backward compatibility spring.kafka.producer.properties.auto.register.schemas=true spring.kafka.producer.properties.use.latest.version=true When you try to publish an incompatible schema, the producer fails at startup, before reaching production. Database: Flyway Validation Flyway validates migrations on startup: Properties files spring.flyway.validate-on-migrate=true spring.flyway.baseline-on-migrate=false If someone manually modified the database schema, Flyway detects the mismatch and fails the deployment. Integration Flow Diagram Figure 2: End-to-end flow showing REST → DB → Kafka integration with contracts enforced at each boundary Real-World Benefits: Production Implementation Experience After implementing contract-first across multiple services (orders, billing, inventory), teams consistently observe significant improvements: Integration quality: Substantial reduction in integration bugs reaching productionMost remaining bugs are business logic issues, not contract mismatchesClear contracts eliminate ambiguity in integration expectations Development velocity: Integration cycles measured in weeks rather than monthsConsumer teams start integration work immediately instead of waiting for provider completionMock servers enable realistic integration testing without coordination delays Operational reliability: CI-enforced contract validation prevents breaking changes from reaching productionBreaking changes caught during PR review, not after deploymentAutomated validation ensures compatibility before merge Documentation accuracy: Documentation generated from OpenAPI specs stays synchronized with implementation by designSwagger UI always reflects actual API behavior as both derive from the same contractNo manual documentation maintenance required Common Pitfalls and How to Avoid Them Pitfall 1: Treating Contracts as Documentation Wrong approach: Write code first, generate OpenAPI from annotations later.Right approach: Write OpenAPI first, generate server stubs, or validate implementation against it. Pitfall 2: Skipping CI Validation Wrong approach: Trust developers to manually check compatibility.Right approach: Automate breaking change detection in CI. Make it a required check. Pitfall 3: No Schema Evolution Strategy Wrong approach: Add fields without defaults, breaking old consumers.Right approach: All new fields must be optional or have defaults. Test with multiple schema versions. Pitfall 4: Ignoring Idempotency Wrong approach: Assume exactly-once delivery in Kafka.Right approach: Design consumers to deduplicate by eventId. Assume at-least-once. Pitfall 5: Coupling Contracts to Implementation Wrong approach: Expose internal database schema directly in API contracts.Right approach: Design contracts for consumers, not internal convenience. Use DTOs that map between the external contract and the internal model. When Contract-First Makes Sense Contract-first isn't always the answer. Use it when: ✅ Multiple teams integrate: The coordination cost justifies the upfront contract design effort✅ Public APIs or partner integrations: External consumers need stability and clear documentation✅ Microservices architecture: Services must evolve independently without breaking dependents✅ High change frequency: CI validation catches breaking changes early when changes are frequent Skip contract-first when: ❌ Single small team: Coordination overhead is low, so formal contracts add friction❌ Prototyping: You're exploring the problem space and expect major pivots❌ Internal tools with one consumer: The provider and consumer are maintained by the same person Key Takeaways Contracts enable parallel development: Provider and consumer teams work simultaneously instead of seriallyCI validation prevents breaking changes: Automate OpenAPI diffs, schema compatibility checks, and migration validationIdempotency is not optional: At-least-once delivery in Kafka requires consumers to deduplicate by event IDSchema evolution requires backward compatibility: Use nullable fields with defaults to support gradual rolloutsContracts are design specifications, not afterthoughts: Define contracts first, generate code from them What's Next? If you're implementing contract-first integration: Start with one integration, pick your most painful cross-team dependencyWrite the OpenAPI spec or Avro schema first, before any implementationSet up CI validation for that contract (openapi-diff or Schema Registry)Measure reduction in integration bugs and coordination overheadExpand to other integrations once you've proven the pattern Contract-first isn't a silver bullet, but it transforms integration from a coordination nightmare into a governance problem that CI can solve. When you're coordinating three teams across two time zones, that shift makes the difference between shipping in weeks versus months. Full source code: github.com/wallaceespindola/contract-first-integrations Related reading: OpenAPI 3.0 Specification: spec.openapis.orgConfluent Schema Registry: docs.confluent.io/platform/current/schema-registryFlyway Documentation: flywaydb.org/documentationMastering Contract-First API Development: moesif.com/blogResearch on Microservices Issues: arxiv.org Need more tech insights? Check out my GitHub, LinkedIn, and Speaker Deck. Happy coding!

By Wallace Espindola
Chaos Engineering Has a Blind Spot. Agentic AI Lives in It.
Chaos Engineering Has a Blind Spot. Agentic AI Lives in It.

Your chaos experiments passed. Your RAG pipeline is lying to you anyway. I've watched this play out more times than I'd like to admit. A team runs a thorough chaos suite, including pod failures, network partitions, and database failovers. Everything recovers cleanly. Dashboards stay green. The team ships with confidence. Three weeks later, a support ticket surfaces. Then ten more. The AI is producing answers that are fluent, confident, and factually wrong. No alert fired. No SLO breached. The infrastructure never blinked. This isn't a monitoring gap you close with a better dashboard. It's a category error in how we've defined resilience for AI systems, and until you see that distinction clearly, every chaos experiment you run is measuring the wrong thing. The Assumption That's Been Quietly Wrong For fifteen years, chaos engineering has operated on one core premise: the system's meaningful state is its operational state. Is it up? Does it recover? Can it handle a node failure at 2 AM? For systems built around databases, queues, and network hops, these are exactly the right questions. The entire discipline of Chaos Monkey, Gremlin, LitmusChaos, and AWS FIS was built to answer them. Agentic AI systems break this premise at the foundation. They're not distributed systems in the traditional sense. They're reasoning systems. And reasoning systems have two states you need to care about simultaneously: State dimensionTraditional distributed systemAgentic AI systemWhat "healthy" meansService is up, latency within SLAOutputs remain grounded in source truthHow failure manifests5xx errors, timeouts, crashesSilent drift, confident wrong answersTime to detectSeconds to minutesDays to weeks — if everFailure unitRequest or serviceBehavior over timeCircuit breaker analogyTrips on error rateNo native equivalentWhat chaos testsInfrastructure recovery✗ Cannot test behavioral integrity That last row is the entire problem. As Marc Bishop, Director of Business Growth at Wytlabs, put it after his team's retrieval embeddings drifted silently under catalog updates: "Resilience for AI means validating behavior under stress, not merely surviving it." I hold U.S. Patent 12242370B2 for intent-based chaos engineering, a framework that treats intent preservation, not just infrastructure recovery, as the core testable property of a resilient system. When I developed that framework, the failure mode I was targeting was a multi-domain infrastructure losing semantic coherence under adversarial conditions. I didn't fully anticipate how precisely that same problem would show up in production LLM pipelines and how fast. What's Actually Breaking: Five Failure Modes Nobody Has Named Yet You can't test for something you haven't named. The existing chaos engineering literature has no vocabulary for AI behavioral failure. Here's a working taxonomy from production accounts across 25+ engineering teams: 1. Retrieval Drift The vector retrieval layer silently shifts toward faster, lower-precision matches after a failure event. Outputs remain structurally valid but are grounded in the wrong documents. Rafael Sarim Oezdemir, Head of Growth at EZContacts, ran chaos injection on their RAG-based customer support chatbot. His infrastructure numbers post-chaos looked perfect: 99.99% uptime, clean latency recovery, green across the board. Three days later, the chatbot was answering return policy questions incorrectly in 7% of cases. Root cause: "Our chatbot started answering return policy questions incorrectly. We diagnosed the root cause as a subtle shift in retrieval precision; our pipeline was favoring quicker, less precise vector matches post-chaos. Infrastructure recovered. The behavior of the model didn't." No existing chaos tool measures retrieval precision. That's the gap. 2. Context Amnesia Each individual component in a multi-agent pipeline appears healthy, but the end-to-end reasoning chain becomes incoherent across hops. Luis Haberlin at CallSetter AI watched this unfold in a voice agent for an insurance brokerage: "The infrastructure was bulletproof... but often into production, agents started hearing 'I already told the robot about my home and auto' from confused callers." The agent correctly retrieved policy details early in a conversation, then lost context at the 90-second mark and restarted the needs assessment from scratch. Nothing crashed. The reasoning rotted at the handoff boundary. Jacob Kalvo, CEO of Live Proxies, hit the same wall in a market analytics pipeline: "While each summary was technically provided on schedule, there were small errors beginning to creep into the output, specific market signals being under-represented, inconsistencies developing in the logic chain, and some outputs making confident assertions regarding incorrect or misleading information." Every infrastructure check passed. The reasoning chain had silently decohered. 3. Confidence-Accuracy Decoupling The model produces high-confidence, well-formatted outputs even as accuracy degrades. The system sounds more certain as it becomes less reliable. Jayanand Sagar, COO at Hyperbola Network, saw this after a partial node recovery rebuilt the retrieval index from a stale snapshot. Output quality deteriorated over 11 days, undetected: "The model never complained. The closer the degraded output was to the original, the more convincingly it generated confident-sounding responses based on outdated context." Confidence scores are not accuracy proxies. A model grounded in a degraded context will confidently state incorrect information. No infrastructure metric tells you this is happening. 4. Intent Drift Outputs gradually decohere from the original business intent without any single triggering event. Behavior changes incrementally, across dozens of interactions, with no failure timestamp to anchor an investigation. Tyler Denk, CEO of beehiiv, described a system that passed every load and failure scenario correctly in testing, then shifted over longer production cycles: "The structure of responses remained intact, but subtle inconsistencies in reasoning and formatting started appearing across different workflows. Without a defined behavioral baseline, it became impossible to determine when the system had actually started drifting." 5. Epistemic Failure The model's picture of the world becomes stale or wrong, but all reasoning over that picture continues to function correctly. The system is reasoning well, about incorrect premises. Nicolas, founder of Reddinbox, runs a production AI pipeline classifying Reddit posts in real time across thousands of threads daily. "A few months back, everything looked fine. No downtime, no errors, latency normal. But output quality had quietly decayed." Reddit's content distribution had shifted, flooded with AI-generated posts that were structurally coherent but semantically hollow, and his classifier kept returning high-confidence scores on them. His diagnosis is the sharpest framing I've seen for why infrastructure chaos is blind to this failure class: "No chaos experiment would have caught that because the failure wasn't infrastructure, it was epistemic. We had zero observability on input distribution drift. We were watching the system, not what the system was consuming." Why Agentic Pipelines Make Every One of These Worse A single degraded LLM component is a tractable problem. A multi-agent pipeline turns it into something that actively resists detection. In a traditional microservice, a degraded component returns an error, trips a circuit breaker, and gets isolated. In a multi-agent pipeline, a degraded reasoning component returns a confident output that propagates forward, amplifying the failure rather than surfacing it. Dario Ferrari, co-founder of OpenClawVPS, watched this play out firsthand when a client's RAG-based customer support system passed all infrastructure tests but then silently shifted retrieval behavior after a network partition: "AI infrastructure that survives every test but provides incorrect answers is still resilient but fails its job badly." The blast radius of an undetected reasoning failure grows with every agent hop. By the time users notice, it has compounded through multiple layers of stored state. The Missing Layer: Behavioral Assertions Brandy Hastings, SEO Strategist at SmartSites, described the realization her team came to after AI-assisted workflows passed every infrastructure check but degraded in production: "We realized our testing didn't account for output quality over time. We were validating uptime, not alignment." That gap between uptime and alignment is where every one of the five failure modes above lives. Most teams have three layers of observability, and only two of them are working: Layer 2 is where all the interesting failures live, and it's completely absent from most production stacks. Building it requires three things your current chaos practice almost certainly lacks: Behavioral contracts – not "returns a 200 response" but "returns a response with retrieval precision above threshold X when operating on a degraded index." These are the AI equivalent of SLOs, except the metric is semantic rather than operational. Intent-preserving chaos experiments – injecting failures at the data layer, retrieval layer, and reasoning layer, not just infrastructure. Each experiment needs an exit criterion that includes behavioral scoring against a fixed ground-truth set, not just recovery metrics. Post-chaos behavioral scoring – sampling outputs after every chaos run and scoring them against a baseline. Jayanand Sagar put a concrete benchmark on the minimum viable version: "An exponential run of chaos should pass behavioral standards to be within 3 to 5 percent of baseline scores of at least 50 sampled outputs before a system is declared stable." Jake Waldrop, Co-Founder of Recademics – a regulated outdoor safety certification platform, independently arrived at this same framing: "Semantic monitoring fills the gap between AI health and user safety by verifying what the AI is saying. My most significant change was to run adversarial prompts on standard stress tests to understand whether the model logic would collapse. Chaos engineering will have a colossal safety advantage when behavioral checks are integrated into any company operating within highly regulated industries." Oksana Fando, CDO at Truck1.eu, reached the same conclusion after equipment descriptions on their European vehicle marketplace gradually became less accurate following a data source degradation and a failure invisible to every standard metric: "We began testing the system's intent, checking whether business logic remains correct even with partial data loss." Testing system intent. That's exactly the property my patent formalizes. The fact that teams in healthcare, fintech, edtech, and European e-commerce are all independently converging on this is no coincidence. It's a structural gap making itself known. A Behavioral Observer You Can Drop In This Week The pattern is a sampling observer sitting in your serving layer. Replace _score() with RAGAS faithfulness, embedding cosine similarity, or an LLM-as-judge evaluator, depending on your quality rubric. The heuristic below is a working default: groundedness (how much of the response is anchored in retrieved docs) minus a penalty for hedging language that signals confidence erosion. Python import random class BehavioralObserver: def __init__(self, sample_rate=0.05, drift_threshold=0.15, baseline_size=50): self.sample_rate = sample_rate self.drift_threshold = drift_threshold self.baseline_size = baseline_size self.scores = [] self.baseline = None def observe(self, prompt, response, context): if random.random() > self.sample_rate: return score = self._score(response, context) if self.baseline is None: # Phase 1: build baseline self.scores.append(score) if len(self.scores) >= self.baseline_size: self.baseline = sum(self.scores) / len(self.scores) return drift = self.baseline - score # Phase 2: detect drift if drift > self.drift_threshold: print(f"[DRIFT ALERT] score={score:.3f} baseline={self.baseline:.3f} drift={drift:.3f}") # pagerduty.trigger(...) or datadog.metric("ai.behavioral.drift", drift) def _score(self, response, context): doc_words = set(" ".join(context.get("retrieved_docs", [])).lower().split()) terms = response.lower().split() groundedness = len([t for t in terms if t in doc_words]) / max(len(terms), 1) hedges = ["i think", "not sure", "might be", "possibly"] return max(groundedness - sum(0.05 for h in hedges if h in response.lower()), 0.0) # Drop in: observer = BehavioralObserver() def serve(prompt, context): response = your_llm_call(prompt, context) observer.observe(prompt, response, context) return response Two things worth knowing. The 5% sample rate catches degradation without adding latency, at high traffic, even 1% gives you a statistically robust signal. The baseline lock after 50 samples is deliberate: running behavioral chaos against an unlocked baseline is like running load tests before you've measured normal traffic. 5 Behavioral Chaos Experiments to Run After Your Next Infrastructure Suite These aren't replacements for your existing chaos experiments. They're additive — run them after your infrastructure suite, with behavioral scoring as the exit criterion rather than uptime recovery. ExperimentWhat you injectWhat it testsExit criterionStale embedding injectionReplace embeddings with a 14-day-old snapshotRetrieval precision under stale indexScore within 5% of baseline across 50 sampled promptsPartial index degradationRemove 30% of documents from the vector storeGraceful degradation in retrieval recallHallucination rate stays flat vs. baselineContext window truncationTruncate retrieved context to 40% of normalReasoning quality under a constrained contextGroundedness score stays above thresholdAgent handoff latency injectionAdd 800ms delay between agent hopsMulti-agent coherence under degraded commsEnd-to-end intent preserved across all hopsMemory poisoning simulationInject one factually wrong document into the retrieval storeRAG faithfulness under adversarial dataThe system identifies or flags the conflicting document Define the exit criterion before you inject the failure. That's the same discipline your infrastructure chaos practice demands for SLO-based rollback conditions; it applies here too. What the Field Is Actually Saying Vitaly Yago, CEO of PhotoGov, described the shift his team made after hitting this wall in production: "We began implementing chaos for behavior, not just for infrastructure. Instead of testing whether the system will recover, we test whether the quality of decisions is maintained under noise, data changes, and successive updates." John Russo, VP of Healthcare Technology Solutions at OSP Labs, came to the same realization after behavioral degradation appeared in a clinical AI workflow that had passed every infrastructure check: "It is no longer just about systems staying up, it is about systems staying correct under stress." Two engineers, two completely different industries, same conclusion. The field has moved on from the question of whether AI systems survive failure. The question it's now wrestling with, without a good answer yet at scale, is whether they reason correctly after failure. The chaos engineering discipline has fifteen years of hard-won tooling for testing the first question. It has almost nothing for the second. That's not a criticism of the existing tools. It's a signal that the discipline needs to grow a second layer. The practitioners whose experiences shaped this article are already building it in production, because the failures forced them to. The only question for your team is whether you discover your agentic system's behavioral limits through a chaos experiment you designed, or a production incident you didn't see coming. The Short Version: Three Things to Add Before Your Next Chaos Run Lock a behavioral baseline first. Sample 50–100 representative inputs and store expected outputs before injecting any failure. Your chaos experiments now have a behavioral exit criterion, not just infrastructure recovery metrics.Make retrieval precision a first-class signal. The most common failure vector across the teams I spoke with was RAG degradation invisible to standard monitoring. Retrieval precision scoring belongs alongside latency and error rate on your dashboards.Log reasoning chains, not just outputs. For multi-agent pipelines, log the reasoning path each agent used to produce its output. When that structure changes without a deployment event triggering it, that's your behavioral alert, the equivalent of a latency spike, but for the quality of reasoning.

By Sayali Patil
Building a Zero-Cost Approval Workflow With AWS Lambda Durable Functions
Building a Zero-Cost Approval Workflow With AWS Lambda Durable Functions

When AWS announced Lambda Durable Functions at re: Invent 2025, my first reaction was, "Okay, but how is this different from Step Functions?" I have been building serverless workflows on AWS for a while now, and Step Functions has always been my go-to service for orchestrating multi-step pipelines. So naturally, I wanted to put this new capability to the test. I decided to build a simple document processing workflow, an ETL pipeline with human-in-the-loop approval using both Durable Functions and Step Functions, then run 1,000 actual document processing workflows through each system. What I found surprised me. Not just the cost difference (79% cheaper with Durable Functions), but the trade-offs that nobody is really talking about yet. In this tutorial, I will walk you through building a zero-cost approval workflow using Lambda Durable Functions with Python. Along the way, I will share the actual cost numbers and the lessons that would have saved me a few hours of debugging. The Problem: Approval Workflows Are Expensive If you have ever built a document processing system that requires human approval, you know the pain. Someone uploads a file, your system processes it, and then... it sits there. Waiting for a human to review and approve it. That wait can be 5 minutes, 20 minutes, or even hours. Traditional approaches to handling this waiting are: Polling: Your code keeps checking every 30 seconds — "Is it approved yet? How about now?" making those calls the entire time.Always-on server: An EC2 instance or ECS container sits idle, costing you money 24/7, just to catch that one approval event.External state management: You build a custom solution with DynamoDB, SQS, and Lambda triggers — works fine, but it requires you to maintain a state machine you built yourself. What if your workflow could just... pause? No compute charges. No polling. Just pause, wait for the human to do their thing, and resume exactly where it left off. That is exactly what Lambda Durable Functions enables with the wait_for_callback pattern. What We Are Building Here is the workflow we will implement: Extract data → Transform data → Load data → Wait for approval (≈20 min) → Finalize & archive A CSV file gets uploaded to an S3 bucket under the uploads/ prefix. Our durable function picks it up, runs it through three ETL steps (extract, transform, load), then pauses execution and waits for a human to approve the processed data through a shared approval API. Once approved, the function resumes, finalizes the job, and archives the file. The key part? During that 20-minute (or 2-hour, or 2-day) approval wait, you pay absolutely nothing for compute. Architecture Overview The project uses three separate SAM stacks: Markdown shared-resources/ # Approval API, DynamoDB, SNS (shared by both systems) durable-functions/ # Lambda Durable Functions ETL pipeline step-functions/ # Step Functions ETL pipeline (for comparison) The shared approval handler serves for both workflow types using a single API. When a job comes in for approval, it checks the workflowType field, and if it is durable-functions, it calls send_durable_execution_callback_success.If step-functions, it calls send_task_success. Same API endpoint, different callback mechanisms under the hood. Prerequisites Before we begin, make sure you have the following: AWS SAM CLI (latest version recommended)Python 3.14 runtime AWS account with Lambda, DynamoDB, S3, SNS, and API Gateway accessDocker for local Lambda testing Check your SAM CLI version: Markdown sam --version Step 1: Deploy Shared Resources First Before the ETL pipeline, we need the shared infrastructure — the approval API, DynamoDB table for pending approvals, and SNS topic for notifications. Here is the shared-resources/ SAM template: YAML # shared-resources/template.yaml AWSTemplateFormatVersion: '2010-09-09' Transform: AWS::Serverless-2016-10-31 Description: Shared resources for ETL approval workflow Parameters: ApproverEmail: Type: String Description: Email address to receive approval notifications Default: [email protected] Resources: PendingApprovalsTable: Type: AWS::DynamoDB::Table Properties: TableName: etl-pending-approvals BillingMode: PAY_PER_REQUEST AttributeDefinitions: - AttributeName: jobId AttributeType: S KeySchema: - AttributeName: jobId KeyType: HASH TimeToLiveSpecification: AttributeName: ttl Enabled: true ApprovalNotificationTopic: Type: AWS::SNS::Topic Properties: TopicName: etl-approval-notifications Subscription: - Endpoint: !Ref ApproverEmail Protocol: email ApprovalApi: Type: AWS::Serverless::Api Properties: Name: ETL-Approval-API StageName: prod ApprovalHandlerFunction: Type: AWS::Serverless::Function Properties: FunctionName: ETL-Approval-Handler CodeUri: ./src Handler: approval_handler.handler Runtime: python3.14 MemorySize: 256 Timeout: 30 Environment: Variables: APPROVALS_TABLE: !Ref PendingApprovalsTable Policies: - DynamoDBCrudPolicy: TableName: !Ref PendingApprovalsTable - Version: '2012-10-17' Statement: - Effect: Allow Action: - states:SendTaskSuccess - states:SendTaskFailure - lambda:SendDurableExecutionCallbackSuccess - lambda:SendDurableExecutionCallbackFailure Resource: '*' Events: ApproveJob: Type: Api Properties: RestApiId: !Ref ApprovalApi Path: /approve/{jobId} Method: POST RejectJob: Type: Api Properties: RestApiId: !Ref ApprovalApi Path: /reject/{jobId} Method: POST GetJobStatus: Type: Api Properties: RestApiId: !Ref ApprovalApi Path: /status/{jobId} Method: GET Notice the approval handler has permissions for both states:SendTaskSuccess (Step Functions) and lambda:SendDurableExecutionCallbackSuccess (Durable Functions). This is the shared handler approach, one API that works with both workflow types. Deploy it: Markdown cd shared-resources sam build sam deploy --guided Step 2: The Durable Functions SAM Template Now the ETL pipeline itself for the Duration Functions. The key addition is the DurableConfig property. The DurableConfig property tells Lambda to enable durable execution for your function. YAML # durable-functions/template.yaml AWSTemplateFormatVersion: '2010-09-09' Transform: AWS::Serverless-2016-10-31 Description: Lambda Durable Functions ETL Pipeline Globals: Function: Runtime: python3.14 Architectures: - arm64 MemorySize: 512 Timeout: 900 Resources: ETLOrchestratorFunction: Type: AWS::Serverless::Function Properties: FunctionName: ETLDurableOrchestrator CodeUri: ./src Handler: handlers/etl_handler.lambda_handler MemorySize: 1024 Timeout: 900 DurableConfig: ExecutionTimeout: 86400 # 24 hours for human approval RetentionPeriodInDays: 14 # Keep execution history for debugging AutoPublishAlias: live Policies: - AWSLambdaBasicExecutionRole - S3CrudPolicy: BucketName: !Sub "${RawBucketName}-${AWS::AccountId}" - S3CrudPolicy: BucketName: !Sub "${ProcessedBucketName}-${AWS::AccountId}" - DynamoDBCrudPolicy: TableName: !Ref ETLMetadataTable - DynamoDBCrudPolicy: TableName: etl-pending-approvals - SNSPublishMessagePolicy: TopicName: etl-approval-notifications Events: S3Upload: Type: S3 Properties: Bucket: !Ref RawDataBucket Events: s3:ObjectCreated:* Filter: S3Key: Rules: - Name: prefix Value: uploads/ - Name: suffix Value: .csv Environment: Variables: PROCESSED_BUCKET: !Sub "${ProcessedBucketName}-${AWS::AccountId}" METADATA_TABLE: !Ref ETLMetadataTable APPROVALS_TABLE: etl-pending-approvals APPROVAL_TOPIC_ARN: !ImportValue ETL-ApprovalTopicArn APPROVAL_API_URL: !ImportValue ETL-ApprovalApiUrl A few things to notice here: MemorySize: 1024 on the orchestrator (overrides the 512 MB global default). Since this single function does all the work, it needs more memory.ExecutionTimeout: 86400 – This is the total workflow duration across all invocations (24 hours). The standard Timeout: 900 is the per-invocation limit (15 minutes). Each checkpoint/resume is a fresh invocation.AutoPublishAlias: live – AWS recommends using Lambda versions with durable functions. If you update code while an execution is suspended, replay will use the version that started the execution.S3 filter with prefix: uploads/ and suffix: .csv – Only CSV files under the uploads/ directory trigger the workflow.The stack imports shared resources via !ImportValue the approval table, SNS topic, and API URL from the shared stack. Step 3: Writing the Durable Function This is where it gets interesting. The entire ETL pipeline, including the approval wait, lives in a single Lambda function. No state machine definition. No JSON DSL. Just Python code. First, the individual ETL steps. Each one is a regular Python function in a separate file: Extract Python import csv import io import boto3 import logging logger = logging.getLogger() s3_client = boto3.client("s3") def extract_data(source_bucket, source_key, step_context=None): logger.info(f"Extracting from s3://{source_bucket}/{source_key}") response = s3_client.get_object(Bucket=source_bucket, Key=source_key) content = response["Body"].read().decode("utf-8") reader = csv.DictReader(io.StringIO(content)) records = list(reader) schema = { "columns": reader.fieldnames, "source_file": source_key, "file_size_bytes": response["ContentLength"] } logger.info(f"Extracted {len(records)} records with {len(schema['columns'])} columns") return {"data": records, "record_count": len(records), "schema": schema} Transform Python import logging from datetime import datetime logger = logging.getLogger() def transform_data(raw_data, schema_config, step_context=None): logger.info(f"Transforming {len(raw_data)} records") valid_records, rejected_records = [], [] for i, record in enumerate(raw_data): try: cleaned = {k: v.strip() if isinstance(v, str) else v for k, v in record.items()} if not cleaned.get("id") or not cleaned.get("name"): rejected_records.append({"index": i, "reason": "Missing required field"}) continue if "date" in cleaned: cleaned["date"] = normalize_date(cleaned["date"]) cleaned["_processed_at"] = datetime.utcnow().isoformat() for key in ["amount", "quantity", "price"]: if key in cleaned and cleaned[key]: try: cleaned[key] = float(cleaned[key]) except ValueError: cleaned[key] = None valid_records.append(cleaned) except Exception as e: rejected_records.append({"index": i, "reason": str(e)}) return { "data": valid_records, "valid_records": len(valid_records), "rejected_records": len(rejected_records), "rejection_details": rejected_records[:100] } def normalize_date(date_str): for fmt in ["%Y-%m-%d", "%m/%d/%Y", "%d-%m-%Y", "%Y/%m/%d"]: try: return datetime.strptime(date_str, fmt).strftime("%Y-%m-%d") except ValueError: continue return date_str Load Python import json import boto3 import logging logger = logging.getLogger() s3_client = boto3.client("s3") def load_data(transformed_data, target_bucket, target_key, step_context=None): logger.info(f"Loading {len(transformed_data)} records to s3://{target_bucket}/{target_key}") output_lines = "\n".join(json.dumps(r) for r in transformed_data) s3_client.put_object( Bucket=target_bucket, Key=target_key, Body=output_lines.encode("utf-8"), ContentType="application/jsonl", Metadata={"record_count": str(len(transformed_data))} ) summary = { "record_count": len(transformed_data), "columns": list(transformed_data[0].keys()) if transformed_data else [], "sample_records": transformed_data[:3] } return {"target_path": f"s3://{target_bucket}/{target_key}", "record_count": len(transformed_data), "summary": summary} Notice the steps are plain Python functions — no special decorator, no SDK import. They take step_context=None as an optional last parameter, which keeps them testable outside the durable execution context. Now the main ETL orchestrator that ties it all together: Python import json import os import logging from datetime import datetime from aws_durable_execution_sdk_python import durable_execution, DurableContext from steps.extract import extract_data from steps.transform import transform_data from steps.load import load_data from steps.finalize import finalize_job logger = logging.getLogger() logger.setLevel(logging.INFO) PROCESSED_BUCKET = os.environ.get("PROCESSED_BUCKET") METADATA_TABLE = os.environ.get("METADATA_TABLE") @durable_execution def lambda_handler(event, context: DurableContext): # Handle both S3 event format and direct invocation if "Records" in event: s3_event = event["Records"][0]["s3"] source_bucket = s3_event["bucket"]["name"] source_key = s3_event["object"]["key"] else: source_bucket = event.get("bucket") source_key = event.get("key") # Generate job_id deterministically using context.step() job_id = context.step( lambda _: f"etl-durable-{datetime.utcnow().strftime('%Y%m%d%H%M%S')}-" f"{source_key.split('/')[-1]}", name="generate-job-id" ) context.logger.info(f"Starting ETL job: {job_id}") # Step 1: Extract extracted = context.step( lambda _: extract_data(source_bucket, source_key, None), name="extract-data" ) context.logger.info(f"Extracted {extracted['record_count']} records") # Step 2: Transform transformed = context.step( lambda _: transform_data(extracted["data"], extracted.get("schema", {}), None), name="transform-data" ) # Step 3: Load loaded = context.step( lambda _: load_data(transformed["data"], PROCESSED_BUCKET, f"processed/{job_id}/output.jsonl", None), name="load-data" ) # --- EXECUTION PAUSES HERE --- # The submitter function stores the callback_id in DynamoDB # and sends an SNS notification to the reviewer. # No compute charges while waiting for approval. def submit_for_approval(callback_id: str, ctx): return notify_reviewer(job_id, callback_id, loaded["summary"]) approval = context.wait_for_callback( submitter=submit_for_approval, name="quality-check-approval" ) # Parse approval result if isinstance(approval, str): approval = json.loads(approval) if not approval or not approval.get("approved"): return {"status": "REJECTED", "job_id": job_id, "reason": approval.get("reason", "No reason")} # Step 4: Finalize (only runs after approval) final = context.step( lambda _: finalize_job(job_id, source_bucket, source_key, loaded, approval, METADATA_TABLE, None), name="finalize-job" ) return { "status": "COMPLETED", "job_id": job_id, "records_processed": transformed["valid_records"], "output_path": loaded["target_path"], "approved_by": approval.get("reviewer"), "completed_at": final["completed_at"] } Let me break down the important parts: @durable_execution – This decorator (imported from aws_durable_execution_sdk_python) enables the checkpoint/replay mechanism on the handler.context.step(lambda _: ..., name="...") – Each step call creates a checkpoint. On replay, completed steps return their cached results instantly instead of re-executing.context.wait_for_callback(submitter=..., name="...") – This is the zero-cost waiting magic. The submitter function receives a callback_id which gets stored in DynamoDB. Execution then pauses completely — Lambda saves the state, shuts down, and you stop paying.Determinism matters – Notice job_id is generated inside a context.step(). That is intentional. Since Lambda replays your function from the beginning on resume, datetime.utcnow() would produce a different value on each replay. Wrapping it in a step ensures the timestamp gets checkpointed and replayed consistently. The notify_reviewer function (in the same file) stores the callback details in DynamoDB and sends an SNS notification: Python def notify_reviewer(job_id, callback_id, summary): import boto3 from datetime import timedelta dynamodb = boto3.resource('dynamodb') sns_client = boto3.client('sns') approvals_table = os.environ.get('APPROVALS_TABLE', 'etl-pending-approvals') approval_topic_arn = os.environ.get('APPROVAL_TOPIC_ARN') approval_api_url = os.environ.get('APPROVAL_API_URL') table = dynamodb.Table(approvals_table) ttl = int((datetime.utcnow() + timedelta(hours=24)).timestamp()) table.put_item(Item={ 'jobId': job_id, 'callbackId': callback_id, 'functionArn': os.environ.get('AWS_LAMBDA_FUNCTION_NAME'), 'workflowType': 'durable-functions', 'summary': json.dumps(summary), 'status': 'pending', 'requestedAt': datetime.utcnow().isoformat(), 'ttl': ttl }) if approval_topic_arn: sns_client.publish( TopicArn=approval_topic_arn, Subject=f'ETL Job Approval Required: {job_id}', Message=f"Job ID: {job_id}\n" f"Approve: POST {approval_api_url}/approve/{job_id}\n" f"Reject: POST {approval_api_url}/reject/{job_id}" ) return {"job_id": job_id, "callback_id": callback_id, "status": "pending"} The workflowType: 'durable-functions' field is important — it tells the shared approval handler which callback mechanism to use when the reviewer responds. Step 4: The Shared Approval Handler When the reviewer clicks approve, the shared handler looks up the callbackId from DynamoDB and sends the callback to the paused durable execution: Python # shared-resources/src/approval_handler.py (key excerpt) if workflow_type == 'durable-functions': callback_id = approval_record.get('callbackId') if approved: lambda_client.send_durable_execution_callback_success( CallbackId=callback_id, Result=json.dumps(approval_response) ) else: lambda_client.send_durable_execution_callback_failure( CallbackId=callback_id, Error='JobRejected', Cause=reason or 'Job rejected by reviewer' ) elif workflow_type == 'step-functions': task_token = approval_record.get('taskToken') if approved: stepfunctions.send_task_success( taskToken=task_token, output=json.dumps(approval_response) ) Same API, same reviewer experience — the underlying callback mechanism is the only thing that differs. Step 5: Deploy and Test Deploy in order (shared resources first, since the other stacks import from it): Markdown # 1. Deploy shared resources cd shared-resources sam build && sam deploy --guided # 2. Deploy Durable Functions cd ../durable-functions sam build && sam deploy --guided Generate test data: Markdown python scripts/generate_test_data.py --count 10 --output test-data/ Upload files to trigger the workflow (note the uploads/ prefix — the S3 filter requires it): Markdown aws s3 cp test-data/ s3://etl-raw-data-bucket-YOUR_ACCOUNT_ID/uploads/ --recursive Check approval status and approve: Markdown # Check status curl https://<api-id>.execute-api.us-east-1.amazonaws.com/prod/status/<job-id> # Approve curl -X POST https://<api-id>.execute-api.us-east-1.amazonaws.com/prod/approve/<job-id> \ -H "Content-Type: application/json" \ -d '{"reviewer": "harpreet", "reason": "Data looks good"}' For bulk approvals during testing, the repo includes a handy script: Markdown ./scripts/approve_all_jobs.sh For local testing, the testing SDK supports pytest: Markdown pip install aws-lambda-durable-execution-sdk-testing pytest durable-functions/tests/ Step 6 (Optional): Deploy Step Functions for Comparison If you want to reproduce my full comparison, deploy the Step Functions stack too: Markdown cd step-functions sam build && sam deploy --guided Here is what the same workflow looks like in Amazon States Language: JSON { "StartAt": "ExtractData", "States": { "ExtractData": { "Type": "Task", "Resource": "${ExtractFunctionArn}", "ResultPath": "$.extractResult", "Next": "TransformData" }, "TransformData": { "Type": "Task", "Resource": "${TransformFunctionArn}", "ResultPath": "$.transformResult", "Next": "LoadData" }, "LoadData": { "Type": "Task", "Resource": "${LoadFunctionArn}", "ResultPath": "$.loadResult", "Next": "WaitForApproval" }, "WaitForApproval": { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken", "Parameters": { "FunctionName": "${ApprovalFunctionArn}", "Payload": { "taskToken.$": "$$.Task.Token", "jobId.$": "$.loadResult.job_id", "summary.$": "$.loadResult.summary" } }, "TimeoutSeconds": 86400, "ResultPath": "$.approvalResult", "Next": "CheckApproval" }, "CheckApproval": { "Type": "Choice", "Choices": [{ "Variable": "$.approvalResult.approved", "BooleanEquals": true, "Next": "FinalizeJob" }], "Default": "JobRejected" }, "JobRejected": { "Type": "Pass", "Result": { "status": "REJECTED" }, "End": true }, "FinalizeJob": { "Type": "Task", "Resource": "${FinalizeFunctionArn}", "End": true } } } Compare the two approaches. Durable Functions: one Python file, one Lambda, familiar programming constructs. Step Functions: a JSON state machine definition, five separate Lambda functions, plus the ASL learning curve. Both do the same thing. The Real Cost Numbers Now, here is the part that made me rebuild a mental model I had about serverless orchestration costs. I ran 1,000 CSV files through this exact workflow — both with Durable Functions and with the Step Functions implementation. The approval wait averaged about 20 minutes per document. Cost ComponentDurable FunctionsStep FunctionsDifferenceLambda invocations$0.000358$0.001-64%Lambda duration$0.0308$0.0179+72%State transitions$0.000$0.175-100%DynamoDB$0.003$0.0030%S3 operations$0.010$0.0100%TOTAL$0.044$0.207-79% Source: AWS CloudWatch Metrics The total cost, which is 79% cheaper, is mainly driven almost entirely by one thing: state transitions. Step Functions charges $0.025 per 1,000 state transitions. ASL workflow has 7 states (ExtractData, TransformData, LoadData, WaitForApproval, CheckApproval, JobRejected/FinalizeJob). For 1,000 workflows, that is 7,000 transitions, which costs $0.175. That single line (state transition) item is 84% of the total Step Functions cost. Durable Functions eliminates state transition costs. The trade-off? Higher Lambda duration costs ($0.031 vs. $0.018) because the durable function runs with 1,024 MB memory (single function handling all work) compared to Step Functions using 512 MB per function across five smaller functions. At scale, the difference adds up quickly: Daily VolumeDurable Functions/yearStep Functions/yearAnnual Savings1,000/day$16.06$75.56$59.5010,000/day$160.60$755.60$595100,000/day$1,606$7,556$5,950 And the most important validation: both systems achieved $0 compute cost during the 20-minute approval wait. That is the real game-changer compared to polling or always-on servers. Understanding the Replay Model One thing that confused me initially was the invocation count. I expected 1,000 invocations for 1,000 workflows. Instead, I got 1,788. Here is why. The checkpoint/replay model means each workflow requires a minimum of 2 invocations: Initial invocation — S3 trigger fires, function runs generate-job-id → extract → transform → load → submit-for-approval → pauseResume invocation — Callback received, function replays from the beginning (all completed steps return cached results instantly), then executes the finalize step So the theoretical minimum is 2,000 invocations for 1,000 workflows. The actual number was 1,788 because some workflows were still pending approval when I collected the metrics over the 24-hour measurement window. The important thing to remember: your code must be deterministic. Since Lambda replays your function from the beginning on resume, any non-deterministic operations (random numbers, timestamps, external API calls) must happen inside context.step() blocks so their results get checkpointed. Python job_id = context.step( lambda _: f"etl-durable-{datetime.utcnow().strftime('%Y%m%d%H%M%S')}-" f"{source_key.split('/')[-1]}", name="generate-job-id" ) That is exactly why the job_id generation in our code uses context.step().Without it, the timestamp would change on every replay. Here are some other examples where your code must be deterministic and how to avoid that: Deterministic IsssuesWhy It BreaksSolutionMath.random()Different value on every replayWrap in context.step()Date.now()Time keeps moving forwardUse context.timestamp or wrap in a stepGlobal variablesMight change between replaysPass state through function argumentsExternal API callsNetwork is a lieAlways wrap in context.step()Iterating over Map or SetIteration order can vary by runtimeUse arrays or ensure stable ordering When Not to Use Durable Functions I want to be honest about the trade-offs, because this is not a "Durable Functions is better than everything" story. Choose Step Functions when: Visual debugging matters. The step function state machine execution graph is genuinely superior. You can see exactly which step failed, inspect the input/output of each state, and non-technical stakeholders can actually understand what the workflow is doing. With Durable functions, AWS did provide visual analysis, monitoring, and debugging as well, but its little more developer-friendly. Multi-service orchestration. Step Functions has 220+ native AWS service integrations. DynamoDB, SQS, SNS, ECS, and Glue without writing Lambda glue code. In our Step Functions implementation, the ASL connects directly to Lambda function ARNs with built-in retry policies. With Durable Functions, all integrations go through your Lambda code.Express Workflows apply. For short-duration (under 5 minutes), high-throughput workflows, Step Functions Express Workflows use a different pricing model that can be very competitive. Choose Durable Functions when: Cost optimization is the priority (79% savings at scale)Workflows are Lambda-centric (your logic lives in Lambda code anyway)You prefer writing orchestration in Python/TypeScript/over Amazon States Language. AWS just now released Lambda Duration functions with Java in developer preview.Your logic is complex, and dynamic programming language is preferred by developers over the declarative ASL. AWS recommends a hybrid approach: use Durable Functions for application-level logic within Lambda, and Step Functions for high-level orchestration across multiple AWS services. Concurrency Planning — A Quick Note One thing worth mentioning: Durable Functions consolidates your entire workflow into a single Lambda function (ETLDurableOrchestrator in our case). This means your Lambda concurrency quota directly limits how many workflows can run simultaneously. Step Functions distributes execution across five separate Lambda functions (Extract, Transform, Load, Approval, Finalize), spreading the concurrency demand. In practice, this means you should plan your Lambda concurrency quotas carefully when using Durable Functions. If you expect burst uploads of hundreds or thousands of files at once, set reserved concurrency appropriately for your workload. This applies to both services — the difference is just where the concurrency demand concentrates. Wrapping Up Lambda Durable Functions is a genuinely useful addition to the serverless toolkit. For a simple ETL pipeline with human-in-the-loop approval, it delivered 79% cost savings over Step Functions while achieving the same 100% success rate and zero-cost waiting. The code-first approach feels natural if you are already comfortable writing Lambda functions in Python, TypeScript, or Java. The wait_for_callback pattern for human approvals is clean and straightforward. And the cost savings are real, which is driven entirely by the elimination of state transition charges. That said, Step Functions remains the better choice when visual workflow representation, multi-service orchestration, or operational simplicity are your priorities. There is no universal winner here, and it depends on what your team values more. The complete implementation — both SAM stacks, shared approval infrastructure, test data generation scripts, bulk approval scripts, and detailed cost analysis — is available here: github.com/hsiddhu2/aws-lambda-durable-vs-stepfunctions. Clone it, deploy both implementations, run your own 1,000-file comparison, and see the numbers for yourself. The ~79% cost advantage held consistent for this workflow, but your number will vary based on workflow complexity and state count.

By Harpreet Siddhu
Your AI Agent Tests Are Passing, But Your Agent Is Still Broken
Your AI Agent Tests Are Passing, But Your Agent Is Still Broken

I was building an AI agent that reads log files, calls APIs, and runs tools based on user instructions. Standard stuff for anyone working with LLM-based automation today. I wrote Playwright tests for it. The tests were green. The agent was lying. Here is what happened, and what I had to build to fix it. The Trap I Walked Into As covered in Building a New Testing Mindset for AI-Powered Web Apps, "unlike a rules-based form, the AI agent might phrase the same question differently each time — making it impossible to write a single pass/fail test script." I hit this immediately. My first test looked like this: TypeScript expect(output).toBe("I read logs/test-results.log. Summary: 2 tests failed, 8 passed."); It passed last week. It failed this week. The model said: Plain Text I checked logs/test-results.log. Summary: 8 passed, 2 failed. Same meaning, but different words, different order, and Test broken. So I switched to snapshots - same problem, bigger diffs. Then, regex is fragile and impossible to maintain. Then I checked only HTTP status and "no crash" — tests went green while the agent picked the wrong tool entirely or gave a confident, wrong answer. After all of that, I realized the issue: I was treating LLM output like fixed copy. I was testing the model's writing style, not the agent's behavior. The Bug That Changed How I Think About This This is the one that made the problem concrete for me. The task: "Read notes/meeting.txt and give me a one-line summary." My test: TypeScript expect(reply.trim().length).toBeGreaterThan(0); The agent returned a perfectly normal sentence. Test passed. What actually happened: the model never read the file. It guessed a plausible summary from the prompt alone and returned it as if it had done the work. The reply was non-empty, so the assertion was satisfied. That test wasn't checking agent behavior. It was checking that the model could generate a sentence, which it always can. The question I needed to answer was not "did it return text?" but "did it actually call the file-reader tool?" Those are different questions entirely. What to Test Instead Effectively Managing AI Agents for Testing puts it well: agents are best understood as a system prompt combined with state, memory, and a selection of tools. That definition is exactly why testing them requires a different approach — you are testing decisions, not return values. When I stepped back, I realized agent testing has three distinct layers that traditional assertions don't cover: Decisions – which tool did it pick, and did it pick the right one?Sequence – for multi-step tasks, did it follow a valid order?Output rules – does the answer satisfy flexible behavioral rules, not a frozen string? None of these maps cleanly to expect(output).toBe(...) What I Built I built AgentAssert - a Playwright-based reference implementation of five testing patterns for agents that call tools. The core idea: instead of asserting on the final text, assert on the trace — a complete log of every decision the agent made, every tool it called, and every result it received. TypeScript const trace = await agent.run("Read logs/app.log and summarize errors"); // Did it actually use the tool? AgentAssert.toolWasInvoked(trace, 'file-reader', { filePath: /.*\.log$/ }); // Did it say the right kind of thing? AgentAssert.satisfiesContract(trace.output, BehaviorContract.SUMMARIZATION); The five patterns the repo demonstrates: Pattern 1 – Tool Invocation: Did the agent call the right tool? This catches the meeting.txt class of bug - a confident-sounding answer with no actual work behind it. Pattern 2 – Behavior Contracts: Does the output satisfy flexible rules (required fields, must-include concepts, forbidden phrases) without requiring exact wording? The contract matcher is rule-based - keywords and patterns - not a second AI model. It is inspectable and cheap to run. Pattern 3 – Multi-Step Trace Verification: For tasks that require two tools in sequence, did the agent follow the right order? Browser tests check page state. These tests check the agent's internal reasoning path. Pattern 4 – Boundary Enforcement: Did the agent stay within its allowed tools, or did it hallucinate tool names and try to call things it shouldn't? This one catches scope creep early. Pattern 5 – Failure Observability: When a tool errors, does the agent report the failure honestly or claim success anyway? Most agent test suites never simulate tool failures. This pattern forces it. Why Playwright and not Jest This repo uses Playwright as the test runner, which surprised a few people who reviewed it. Playwright is usually a browser testing tool. The reason is practical. Agent tests are slow and flaky by nature — LLM responses vary, API calls take time. Playwright gives you per-test timeouts, built-in retries, HTML reports with attachments, and worker-level isolation. Jest requires plugins or manual configuration for all of that. When a behavioral test fails, the HTML report shows the full agent trace attached directly to the failure — which tool ran, in what order, and what the model said at each step. Playwright's capabilities go well beyond browser testing. Master API Testing with Playwright covers how it handles retries, timeouts, and network interception for backend flows. AgentAssert builds on those same strengths - applied to LLM tool-call loops instead of HTTP endpoints. Using Playwright without a browser is unconventional. But for this problem, it fits better than the alternatives. What This Doesn't Solve The contract matcher works on keywords and patterns. If the agent says "unable to locate the file" instead of "file not found" and your contract only lists one phrasing, it may fail even though the meaning is the same. This is a real limitation. More sophisticated approaches exist. 5 Agent CI/CD Evaluation Best Practices describes using an LLM-as-judge with soft and hard failure thresholds. That approach is more powerful but adds cost and latency. The contract matcher here is deliberately simpler - inspectable rules you can read and tune in one file. This repo also does not test security, production monitoring, or external system behavior. It tests what you define rules for. The value lies in catching common failures — wrong tool, wrong order, false success, scope violations— at a low cost and with repeatability in CI. The Shift in Mindset When I finished building this, the thing that stuck was not the code. It was the reframe. Software Testing in the LLM Era describes how the tester's role is moving from executing scripts to validating AI decisions. The five patterns in this repo are one practical step in that direction. Agents are not functions. You cannot test them the way you test a function that returns a fixed value for fixed input. An agent makes decisions. You need to test the decisions — what it chose to do, in what order, and whether it stayed honest when things went wrong. The code is at github.com/bireshpatel/agent-assert. It is a reference implementation, not a published library. Copy the patterns, adapt the framework to your agent, and replace the sample tools with your own.

By Biresh Patel
Why Your DLP Policies Fall Short the Moment AI Agents Enter the Picture
Why Your DLP Policies Fall Short the Moment AI Agents Enter the Picture

I have been working in enterprise data security for a while now, and I have watched the threat landscape shift many times. Ransomware, phishing, insider threats, and cloud misconfigurations. Each wave brought new problems, and organizations learned, adapted, and invested. But what is happening today with AI agents feels different. It is not just a new attack vector. It is a fundamental change in how data moves inside an organization, and most security teams are not ready for it. Let me explain what I mean. Traditional Data Loss Prevention (DLP) was designed with a pretty clear mental model: a human employee sits at a computer, touches sensitive data, and either accidentally or intentionally tries to move it somewhere they should not. Your DLP policy watches for that. It flags the email with the credit card numbers, blocks the USB upload, or quarantines the cloud sync. It works because there is a human in the loop, and human behavior has patterns that security tools can learn. AI agents break that model entirely. An agent does not hesitate before accessing a file. It does not trigger behavioral anomalies because it was granted permission to do exactly what it is doing. It can read thousands of documents in the time it takes a human to open one. And if it is compromised, misconfigured, or simply pointed at the wrong thing, it can exfiltrate data at a scale and speed that no human attacker could match. That is the invisible threat, and it is sitting inside enterprise environments right now. Why AI Agents Are Different From Every Other Threat Before getting into the specific risks, it is worth stepping back to understand what makes agentic AI architecturally different from previous automation tools. Traditional automation scripts or bots were narrow. They did one thing. A script that pulled a report from a database every morning did not have the context or capability to go read your HR files or send data to an external API. The attack surface was small and well-defined. AI agents, by contrast, are designed to be general-purpose. They use large language models to reason about tasks, and they are given tools: the ability to read files, call APIs, browse the web, write to databases, send messages, and interact with other services. This is what makes them powerful for automation. It is also what makes them dangerous from a security standpoint. When you give an agent access to your document store to help employees find information faster, you have also given it, in principle, the ability to read everything in that store. When you connect it to your email system so it can draft replies, you have opened a channel through which data can flow. The agent is not malicious. It is doing exactly what it was built to do. The problem is that the existing security infrastructure was never designed to supervise something that behaves like a trusted user but operates at machine scale. The 5 Data Security Risks Unique to AI Agents 1. Over-Permissioned OAuth Scopes and Shadow Data Access This is the one I see most often in enterprise deployments, and it is almost always accidental. When development teams integrate an AI agent with a SaaS platform, whether it is SharePoint, Google Drive, Salesforce, or Slack, they need to grant the agent API access. The path of least resistance is to grant broad OAuth scopes. Read all files. Access all channels. The agent needs it for the use case, so the scope gets approved, and nobody revisits it. What this creates is a situation where the agent has access to data it will never actually need for its intended job, but which it can reach if something goes wrong. A prompt injection attack, a bug in the agent's reasoning logic, or a malicious instruction buried in a document the agent was asked to summarize could all redirect the agent to access and transmit that shadow data. The NASA ITAR filtering issue from 2019 is a useful reference here, even though it predates AI agents. A security control that was too broad caused operational disruption. The same principle applies in reverse: an agent granted too-broad access can cause a data exposure that was never intended by anyone in the organization. 2. Prompt Injection Leading to Data Leakage Prompt injection is probably the most discussed AI security risk right now, and for good reason. The basic idea is that an attacker can embed instructions inside content that the agent will read, effectively hijacking the agent's behavior. Here is a concrete scenario. An enterprise deploys an AI agent that monitors incoming emails and summarizes them for executives. An attacker sends a carefully crafted email that contains, embedded in normal-looking text, instructions telling the agent to forward all emails it reads to an external address. If the agent's output layer is not properly sandboxed, this kind of attack can succeed without the attacker ever breaking into any system. They just sent an email. This is qualitatively different from phishing. Phishing targets humans and relies on human error. Prompt injection targets the agent and relies on the agent doing exactly what it was designed to do, which is to follow instructions in its input. From a DLP perspective, the data exfiltration looks like authorized activity because the agent was authorized to send data. 3. Retrieval-Augmented Generation Pipelines Pulling Sensitive Context RAG systems, where an agent retrieves documents from an internal knowledge base to ground its responses, are becoming standard in enterprise AI deployments. They are genuinely useful. They are also a data security problem that most teams have not fully thought through. When a user asks a RAG-enabled agent a question, the system searches the knowledge base and pulls in relevant documents as context for the model. The model then uses that context to generate a response. The issue is that the retrieval step is often not governed by the same access controls as direct document access. An employee who does not have permission to read a particular HR policy document might be able to ask the agent a question that causes the agent to retrieve and summarize that document for them. This is not a hypothetical. It is a real architectural gap that exists in many early-stage enterprise RAG deployments. The knowledge base was indexed without granular access metadata, and the retrieval system does not know whether the person asking the question should have access to the documents it is about to surface. 4. Agent-to-Agent Data Passing With No Human Review The next wave of enterprise AI is multi-agent systems, where specialized agents hand off tasks to each other. An orchestrator agent receives a request, breaks it into subtasks, delegates those subtasks to specialized agents, and aggregates the results. This is efficient. It is also a chain of data handling that has no human checkpoint anywhere in the middle. From a security standpoint, this creates what I would call a provenance problem. When data moves through three or four agent hops before producing a final output, it becomes very difficult to audit what data was accessed, what was transmitted between agents, and where the output ended up. Traditional DLP watches data at egress points, but in a multi-agent pipeline, the egress points are not always obvious, and intermediate agent-to-agent communication may not be captured at all. The Capital One breach in 2019 demonstrated how a chain of access privileges, even if each individual link looks authorized, can result in catastrophic data exposure. Multi-agent pipelines create the same kind of daisy-chained access, but at a speed and scale that makes the Capital One incident look slow. 5. AI as a Supply Chain Risk This one is less talked about but deserves attention. Enterprise organizations are increasingly building agents on top of third-party foundation models and agent frameworks. When you do that, you are trusting not just the model's capabilities but also the data handling practices of the model provider and the framework maintainers. If a third-party agent framework has a vulnerability, or if a model provider's logging and telemetry captures inputs in ways that are not disclosed, your sensitive enterprise data could be at risk in ways that your internal DLP policies have no visibility into. The SolarWinds breach in 2020 showed exactly how supply chain trust can be weaponized. AI infrastructure is the new software supply chain, and most enterprises have not started treating it that way yet. What Breaks in Your Existing DLP Policies Most enterprise DLP policies were designed around a set of assumptions that AI agents violate by default. It is worth being specific about this because the gaps are not immediately obvious. First, DLP systems use behavioral baselines. They learn what normal data access looks like for a given user or endpoint and flag deviations. An AI agent does not behave like a human user. Its access patterns are bursty, high-volume, and systematic in a way that looks suspicious to a human but is entirely normal for an agent. Tuning DLP to accommodate agent behavior without opening holes for actual attackers is genuinely difficult. Second, many DLP policies focus on content inspection at egress: checking what is in an email attachment, what is being uploaded to a cloud service, and what is being printed. They are less equipped to inspect data that is being passed between internal systems or that is loaded into an LLM's context window. The context window is, in effect, a temporary data store that existing DLP tools cannot see into. Third, agent actions are often attributed to the agent's service account rather than the human who initiated the request. If something goes wrong, the audit trail points to a service identity, not a person, which makes incident response significantly harder. In my earlier article on DLP policy tuning, I wrote about the importance of finding the balance between protection and usability. With AI agents, that balance has to be rethought from scratch. The old tuning frameworks assume a human actor. Agents are a different category. Mapping the Gap: What Your DLP Covers vs. What Agents Require Traditional DLP AssumptionReality With AI Agents Human actor with behavioral patterns Machine actor with high-volume, systematic patterns Data moves at human speed Data moves at API call speed, thousands of operations per second Egress inspection catches exfiltration Exfiltration can happen inside the context window or between agents Access is tied to user identity Access is tied to service account or OAuth scope Anomaly detection flags unusual behavior Agent behavior looks normal because it was authorized Audit trails point to a person Audit trails point to a service identity Practical Controls: What to Do Today I want to be clear that I am not suggesting organizations should slow down their AI agent deployments. The productivity and operational efficiency gains are real, and the competitive pressure to adopt these technologies is not going away. What I am suggesting is that security needs to be built into the deployment architecture from the start, not layered on afterward. Enforce Least-Privilege Agent Identities Every AI agent should have its own identity, with access scoped to the exact data and systems it needs for its specific function. Not a shared service account. Not a developer's credentials. Not an admin-level OAuth token granted for convenience. This sounds obvious, but in my experience, it is violated in the majority of early enterprise agent deployments because speed of deployment takes priority over access hygiene. Work with your identity team to define agent personas the same way you define human user roles. An agent that summarizes customer support tickets should have read access to the support ticket system and nothing else. If it later needs to write back to the system, that permission should be explicitly granted and reviewed, not assumed. Implement Output Inspection Layers If you cannot yet see inside the context window, you can at least inspect what comes out of it. Treat agent outputs the same way you treat email or file uploads in your DLP system. Apply content detection to the agent's final responses and any data it writes to downstream systems. This will not catch everything, but it will catch cases where sensitive data that should not have been surfaced ends up in an agent's output. Security vendors are beginning to build agent-aware DLP capabilities, and this is an area where the product landscape is evolving quickly. Evaluate whether your current DLP vendor has a roadmap for agent output inspection, and if not, that is a conversation worth having with them. Tag Sensitive Data Before It Enters Agent Context This is where classification infrastructure, which I covered in my DLP policies article, becomes even more critical. If your sensitive documents are properly classified and tagged before an agent can access them, you have the foundation for enforcing context-aware retrieval controls. A RAG system that knows a document is tagged as confidential can check whether the requester has access rights before pulling it into context. This requires investment in tagging infrastructure and close collaboration between your data governance team and the teams building the AI systems. It is not trivial. But it is the most durable defense against the RAG access control gap I described earlier. Build Agent Activity Logging Into the Architecture Every action an agent takes should be logged with enough context to reconstruct what happened. Which documents were accessed, what queries were sent to external APIs, what data was written where, and who or what triggered the agent's actions. This logging should be centralized and tamper-resistant, and it should be integrated with your security information and event management (SIEM) system. The goal is to ensure that when something goes wrong, and at some point, something will, your incident response team has the information they need to understand what data was exposed and how. Without this, you are flying blind. Treat Third-Party Agent Frameworks as Supply Chain Risk Apply the same vendor security review process to AI frameworks and model providers that you apply to any third-party software vendor. Ask about data handling practices, logging and telemetry, vulnerability disclosure processes, and compliance certifications. If a vendor cannot answer these questions clearly, that is a signal worth paying attention to. For federal customers, this intersects directly with FedRAMP and FISMA requirements, which I covered in my earlier piece on federal data security. The compliance overlay does not change the fundamental architecture question, but it does add a layer of formal verification that can be useful. A Note on Vendor Responsibility I want to end with something I feel strongly about, because it reflects what I have seen in my work with enterprise customers. Security vendors have a responsibility here that goes beyond selling products. Right now, most enterprise security products are not ready for the AI agent threat landscape. DLP tools that work beautifully for human-driven data flows struggle with agent-generated activity. SIEM systems that are great at correlating human behavioral signals have not been updated to understand agent orchestration patterns. Identity platforms that manage human identities well are still figuring out how to handle non-human agent identities at scale. This is not a criticism. It is a statement of where the industry is. The technology moved faster than the security tooling, which is how it usually goes. But vendors need to be honest with their customers about these gaps and invest now in the capabilities that enterprise organizations will need over the next 12 to 24 months. The enterprises that will navigate this well are the ones that start the conversation with their security vendors today, before a breach forces the conversation. Ask your DLP vendor how their product handles agent service accounts. Ask your SIEM vendor what their roadmap looks like for multi-agent pipeline visibility. Ask your identity vendor how they plan to support agent persona management. These are not theoretical questions. They are operational requirements. Conclusion AI agents are not going away, and they should not. They represent a genuine step forward in what organizations can accomplish with their data and their people. But every significant capability expansion in enterprise technology has also expanded the attack surface, and this one is no different. The threat is invisible right now because agents look like trusted users. They have credentials, they have permissions, and they perform authorized actions. Traditional security controls are not built to be suspicious of authorized behavior. That is the gap that adversaries will eventually learn to exploit, if they have not already started. The answer is not to slow down AI adoption. The answer is to build the security architecture around it properly: least-privilege agent identities, output inspection, classified data tagging, comprehensive logging, and supply chain rigor for third-party frameworks. None of these is a novel security concept. They are well-understood principles being applied to a new context. Your DLP policies were written for a world where humans moved data. That world still exists, but it now shares space with a world where agents move data faster, on a larger scale, and with less friction than any human ever could. It's time to update the playbook.

By Priyanka Neelakrishnan
RAG Is Not Enough: Advanced Retrieval Architectures Using Vertex AI Search on GCP
RAG Is Not Enough: Advanced Retrieval Architectures Using Vertex AI Search on GCP

Retrieval-augmented generation (RAG) caught on fast — and for good reason. Connecting a large language model to your organization's documents feels like the most natural way to build a useful AI system. You stop relying on what the model memorized during pretraining and start grounding it in knowledge that actually belongs to your business. That promise is real. The problem is that most teams hit a wall somewhere between the prototype and the production deployment, and the wall is almost always the retrieval layer. I've seen this play out repeatedly on Google Cloud projects. A team builds a clean RAG demo: chunk the docs, embed them, store in a vector database, query by similarity, pass context to Gemini. It works beautifully in the sandbox. Then it hits real data — hundreds of thousands of documents, domain-specific terminology, access control requirements, freshness constraints — and the cracks appear fast. Hallucinations creep back in. Responses stop being consistent. Users lose trust. This isn't a problem with RAG as a concept. It's a problem with treating retrieval as a solved, simple step. It isn't. This article walks through why basic pipelines fall short and how to build something that actually holds up in production, using Vertex AI Search and its surrounding GCP ecosystem. Why Simple Vector Search Breaks Down The standard RAG setup works like this: documents get split into chunks, each chunk gets embedded into a vector, those vectors sit in a store, and at query time, you find the most similar vectors to your query embedding. It's clean. It's fast to implement. And it has some predictable failure modes. Semantic similarity isn't precision. Embedding-based search captures "roughly related" well. It struggles when your users need exact matches — product names, internal codes, regulatory identifiers, technical terms with specific meanings in your domain. A query for "Form 1099-NEC filing deadlines" might surface chunks that are thematically adjacent but don't actually answer the question. When you're building customer support or compliance tooling, "roughly related" isn't good enough. There's no enterprise-aware filtering. Real enterprise data has a structure that matters: this document belongs to the legal department, which expired 18 months ago, and these are only visible to users in the EMEA region. Pure vector search has no concept of any of this. You can add metadata filters as a post-processing step, but stitching that together yourself across different data sources is fragile and hard to maintain. Single-pass retrieval has no quality gate. Whatever comes back from the vector store goes directly into your context window. There's no step that asks: "Is this document actually authoritative? Is it the latest version? Does it genuinely answer this specific question, or is it just vaguely topical?" Without a ranking layer that can answer those questions, weak context reaches the model, and you get weak — or wrong — responses. These aren't edge cases. They're what production looks like. What Vertex AI Search Adds to the Stack Vertex AI Search is Google Cloud's managed search and discovery service, and it was designed specifically to handle the kinds of requirements I just described. Rather than treating retrieval as "find the nearest vectors," it layers several capabilities on top of each other. Hybrid retrieval is the foundation. Instead of running a purely semantic search or a purely keyword search, Vertex AI Search runs both simultaneously and combines the results. Lexical matching handles exact terms and identifiers. Embedding-based retrieval handles intent and semantic similarity. In practice, this means your retrieval works well whether the user types something precise ("Q3 2024 procurement policy update") or something vague ("how do we handle vendor approvals"). Metadata-aware filtering lets you attach structured attributes to your documents at index time — department, region, document type, creation date, access level — and query against them alongside the semantic search. You're not doing this in two steps; it's a single unified operation. This is what makes it practical to build retrieval that respects your organizational structure rather than pretending it doesn't exist. Re-ranking is the quality gate that basic RAG lacks. After the initial retrieval pass, Vertex AI Search applies a ranking model that scores documents based on their relevance to the specific query, their authority, and their freshness. More relevant documents rise to the top; vague or outdated matches get demoted before anything reaches the language model. For use cases where precision matters — legal, compliance, customer support — this step is the difference between a trustworthy system and a liability. One thing worth being honest about: this does add complexity and cost compared to a self-managed vector database. Vertex AI Search is a managed service with its own pricing model, and it introduces a degree of vendor dependency. If you're already heavily invested in GCP and you're hitting the limitations described above, the tradeoff is usually worthwhile. If you're building a small prototype or need cloud-neutral infrastructure, the calculus is different. A Concrete Implementation: Multi-Stage Retrieval The most capable retrieval pipelines on GCP combine Vertex AI Search with Vertex AI Agent Builder. Here's the pattern. The Agent Builder defines an agent that handles the orchestration logic. When a query comes in, the agent doesn't just fire a single retrieval call — it runs a multi-stage process: broad semantic search first, then metadata filtering to scope the results, then re-ranking to surface the best matches, then context assembly before the Gemini call. Below is a simplified Python example of querying Vertex AI Search with a metadata filter: Python from google.cloud import discoveryengine_v1 as discoveryengine client = discoveryengine.SearchServiceClient() serving_config = client.serving_config_path( project="your-project-id", location="global", data_store="your-data-store-id", serving_config="default_config", ) request = discoveryengine.SearchRequest( serving_config=serving_config, query="vendor approval process for Q3", page_size=10, filter='department = "procurement" AND document_date >= "2024-01-01"', query_expansion_spec=discoveryengine.SearchRequest.QueryExpansionSpec( condition=discoveryengine.SearchRequest.QueryExpansionSpec.Condition.AUTO, ), spell_correction_spec=discoveryengine.SearchRequest.SpellCorrectionSpec( mode=discoveryengine.SearchRequest.SpellCorrectionSpec.Mode.AUTO ), ) response = client.search(request) for result in response.results: doc = result.document print(f"Document: {doc.name}") print(f"Relevance score: {result.model_scores}") This call does several things at once: it's applying the hybrid search (semantic + keyword) automatically, filtering by metadata attributes, and returning model-scored relevance rankings that you can use to decide how much of the result set to include in your context window. The Production Architecture When this retrieval layer sits inside a full production system, the overall GCP architecture looks roughly like this: Cloud Run handles the API serving layer, receiving user requests and forwarding them to Agent Builder for orchestration. Agent Builder invokes Vertex AI Search for retrieval, assembles the context window, and calls Gemini for generation. Where the application needs to trigger downstream actions — calling internal APIs, writing to a database, sending a notification — those are handled by Cloud Functions invoked from the agent. On the observability side, retrieval quality is something you need to actively monitor. Cloud Logging captures query patterns and latency. BigQuery stores the analytics you need to evaluate ranking quality and response accuracy over time. This is important: retrieval isn't something you tune once and walk away from. Ranking adjustments, content re-indexing, and prompt versioning are ongoing operational tasks. Teams that treat retrieval as a first-class service — with the same instrumentation rigor they'd apply to any production API — get meaningfully better outcomes. Security is handled at the infrastructure level through GCP IAM and VPC Service Controls, which ensure that document access in the retrieval layer respects the same organizational boundaries as the rest of your data governance. What Changes When You Build It This Way The difference between a basic RAG pipeline and this kind of architecture isn't just technical. It changes what you can promise users about the system. With basic vector search, the honest answer to "why did the system say that?" is usually "we don't really know — it found something similar." With hybrid retrieval, re-ranking, and metadata filters, you have an audit trail. You can show which documents were retrieved, why they scored as they did, and how the context window was assembled. That's the foundation of a system users can actually trust, and that your organization can actually govern. It's also worth noting what this architecture doesn't solve. It doesn't solve bad data. If your document corpus is inconsistent, poorly maintained, or missing large sections of relevant knowledge, no retrieval system will compensate for that. Retrieval quality is bounded by data quality. Before investing heavily in retrieval architecture, it's worth taking an honest look at the state of the underlying data. Wrapping Up The teams that are getting the best results from enterprise GenAI right now aren't just the ones with the best models — they're the ones who took retrieval seriously. Vertex AI Search gives you the tooling to build a retrieval layer that handles the real demands of enterprise data: hybrid search, structured filtering, intelligent ranking, and continuous optimization. Combined with Agent Builder and Gemini, it becomes the foundation for AI systems that are grounded, auditable, and reliable enough to actually deploy. The shift is conceptual as much as it is technical: retrieval is a first-class architectural concern, not a preprocessing step. Once you start treating it that way, the path to production gets a lot clearer.

By Abdul Rasheed Shaik
AI Paradigm Shift: Analytics Without SQL
AI Paradigm Shift: Analytics Without SQL

The idea of “asking data questions in plain English” has been around for a while, but most implementations never made it into production in a serious way. The usual reason is not the language model itself but everything around it: security boundaries, schema ambiguity, cost control, and the fact that analytics systems are rarely clean enough for unconstrained natural language to work reliably. What has changed in the last couple of years is not that natural language is suddenly perfect, but that data platforms have started bringing computation, metadata, and AI into the same controlled environment. One example of this approach is the way agents are being built directly inside data warehouses like Snowflake. The important detail is not the brand itself, but the architectural pattern: the model, the data, and the execution layer sit together rather than being stitched across multiple systems. That shift changes how analytics tools are designed. Instead of building external “AI layers” on top of a warehouse, teams are embedding the agent logic inside the warehouse itself using tools like Snowpark and managed LLM services such as Snowflake Cortex. The result is a system where natural language is just another input format, not a separate application tier. From Dashboards to Agent-Driven Querying Traditional analytics workflows are structured around predefined models: dashboards, semantic layers, and curated datasets. A user question is usually translated into one of these prebuilt views. If the question does not fit, someone writes SQL or updates the dashboard. Agent-based systems invert this flow. Instead of forcing questions into predefined structures, they attempt to generate the structure dynamically. At a high level, the flow looks like this: A user submits a natural language questionThe system enriches the prompt with schema and access contextA model generates SQL or an execution planThe query runs inside the warehouseResults are returned in a structured form or visualized output The key difference from earlier “text-to-SQL” experiments is that steps two and three are tightly grounded in the database context. The model is not guessing a schema from generic training data. It is being provided with actual table definitions, column descriptions, and sometimes usage statistics. This context injection is what makes the system usable in production. Without it, SQL generation tends to fail in subtle ways: incorrect joins, wrong aggregations, or hallucinated columns. Agent Architecture Inside the Warehouse A practical implementation of an analytics agent inside a warehouse typically has three layers. 1. Context and Permission Layer Before any model is called, the system resolves what the user is allowed to see. This includes: Role-based access controlRow-level filtersColumn masking rulesAvailable schemas and tables This step is often underestimated, but it is what makes the system safe enough for real usage. Without it, natural language becomes a bypass mechanism for data access control. 2. Language Model Translation Layer Once context is assembled, the prompt is passed into an LLM hosted within the data platform. In Snowflake’s case, this is handled through Cortex services, but the pattern is not unique to any vendor. The model’s job is not just to produce SQL but to produce SQL that is: Syntactically validAligned with schema constraintsConsistent with security rulesOptimized for warehouse execution patterns For example, a question like, “Show top 10 products by revenue in Q1 2024 grouped by region,” might become: SQL SELECT region, product_name, SUM(revenue) AS total_revenue FROM sales.fact_sales WHERE transaction_date >= '2024-01-01' AND transaction_date < '2024-04-01' GROUP BY region, product_name ORDER BY total_revenue DESC LIMIT 10; The challenge here is not generating SQL that looks correct, but ensuring it respects business definitions. Revenue, for example, might need to be net of returns or adjusted for currency conversion, depending on the organization. 3. Execution and Governance Layer Once SQL is generated, it is executed inside the warehouse engine. This is where the architecture becomes important: nothing leaves the system. The same security policies that apply to human-written queries apply here as well. Because execution happens inside the warehouse, audit logs remain consistent. Every agent action can be traced as a query event, which is important for compliance-heavy environments. Why Snowpark Matters in This Setup Tools like Snowpark extend this model beyond SQL generation. Instead of limiting the agent to query rewriting, Snowpark allows it to execute Python-based logic directly next to the data. This becomes useful when the question is not purely relational. For example: “Forecast next month’s sales for product X.” A simple SQL query cannot answer this. The agent can instead generate a Snowpark Python job that: Extracts historical time series dataConverts it into a DataFrameApplies a forecasting model such as ARIMA or ProphetWrites predictions back into a table The important point is that the data is never exported to an external notebook environment. The compute moves to the data, not the other way around. This pattern also applies to machine learning inference. Pretrained models can be registered as user-defined functions, and the agent can call them like regular SQL functions: SQL SELECT feedback_text, predict_sentiment(feedback_text) AS sentiment_score FROM customer_feedback; From a systems perspective, the agent becomes a planner rather than just a translator. It decides whether SQL is sufficient or whether a Python-based workflow is required. The Streamlit Layer: Turning Queries Into Applications While the warehouse handles computation and the agent handles reasoning, users still need an interface. One of the simpler ways to build this layer is with Streamlit. Streamlit is often used because it reduces the overhead of building internal analytics tools. Instead of designing full frontend systems, teams can wrap agent logic into lightweight interactive apps. A minimal pattern looks like this: Python import streamlit as st st.title("Data Agent Interface") query = st.text_input("Ask a question about your data") if query: result = run_agent(query) st.subheader("Generated SQL") st.code(result["sql"]) st.subheader("Results") st.dataframe(result["data"]) In more mature setups, the Streamlit layer becomes more than a query box. It evolves into a dynamic dashboard generator: Charts are generated based on query intentFilters are derived from schema metadataResults can be saved into reusable viewsUsers can refine queries conversationally This reduces dependency on static dashboards, which are often slow to update and hard to maintain. Governance Is the Real Constraint, Not AI Capability A common misconception is that the main challenge in building these systems is model accuracy. In practice, governance is the harder problem. Three constraints usually define whether an agent system is viable: Data access control must remain intact. Natural language cannot become a bypass layer for restricted data.Query cost must be predictable. Poorly generated queries can become expensive quickly in large warehouses.Results must be reproducible. Two identical questions should not produce different interpretations unless the underlying data changes. Warehouse-native architectures help with this because they centralize execution. There is no separate “AI data layer” that can drift from governance rules. Limitations of the Current Approach Despite progress, these systems are not fully autonomous analytics engines. There are still recurring issues: Ambiguous business definitions lead to incorrect aggregationsComplex joins across poorly modeled schemas still fail frequentlyLLMs may overgeneralize metrics like revenue or churnLatency increases when multi-step reasoning is required In practice, most teams deploy agents as assistants rather than replacements for BI systems. They are good at exploration, not final reporting. Closing Thoughts What is emerging is not a replacement for SQL or dashboards, but a new interface layer on top of them. Natural language becomes a routing mechanism that decides how to query or compute over structured data. The interesting architectural shift is that the intelligence layer is moving closer to the data itself. Whether implemented in Snowflake or other platforms, the pattern is consistent: context-aware models, governed execution, and embedded compute through tools like Snowpark and Cortex. Streamlit or similar tools then complete the stack by providing a lightweight interface that can evolve from simple query boxes into full analytical applications. The result is not “analytics without SQL” but something more realistic: analytics where SQL is still present, but no longer the only way in.

By Haricharan Shivram Suresh Chandra Kumar

Culture and Methodologies

Agile

Agile

Career Development

Career Development

Methodologies

Methodologies

Team Management

Team Management

Building a DevOps-Ready Internal Developer Platform: A Hands-On Guide to Golden Paths, Self-Service, and Automated Delivery Pipelines

May 28, 2026 by Mirco Hering DZone Core CORE

Feature Flag Debt: Performance Impact in Enterprise Applications

May 27, 2026 by Poornakumar Rasiraju

DevOps and Platform Engineering Readiness Checklist: Everything Needed for a Scalable, Secure, High-Velocity Delivery Platform

May 27, 2026 by Josephine Eskaline Joyce DZone Core CORE

Data Engineering

AI/ML

AI/ML

Big Data

Big Data

Databases

Databases

IoT

IoT

Implementing Secure API Gateways for Microservices Architecture

May 29, 2026 by Mugunth Chandran

Slopsquatting: Building a Scanner That Catches AI-Hallucinated Packages Before They Reach Production

May 29, 2026 by Denis Ermakov

Event-Driven Pipelines With Apache Pulsar and Go

May 29, 2026 by Shivi Kashyap

Software Design and Architecture

Cloud Architecture

Cloud Architecture

Integration

Integration

Microservices

Microservices

Performance

Performance

Implementing Secure API Gateways for Microservices Architecture

May 29, 2026 by Mugunth Chandran

Implementing Observability in Distributed Systems Using OpenTelemetry

May 29, 2026 by Mugunth Chandran

5 Common Security Pitfalls in Serverless Architectures

May 29, 2026 by Zac Amos

Coding

Frameworks

Frameworks

Java

Java

JavaScript

JavaScript

Languages

Languages

Tools

Tools

Event-Driven Pipelines With Apache Pulsar and Go

May 29, 2026 by Shivi Kashyap

Zero-Downtime Deployments for Java Apps on Kubernetes

May 29, 2026 by Ramya vani Rayala

Pragmatica Aether: Let Java Be Java

May 29, 2026 by Sergiy Yevtushenko

Testing, Deployment, and Maintenance

Deployment

Deployment

DevOps and CI/CD

DevOps and CI/CD

Maintenance

Maintenance

Monitoring and Observability

Monitoring and Observability

Implementing Observability in Distributed Systems Using OpenTelemetry

May 29, 2026 by Mugunth Chandran

Event-Driven Pipelines With Apache Pulsar and Go

May 29, 2026 by Shivi Kashyap

Zero-Downtime Deployments for Java Apps on Kubernetes

May 29, 2026 by Ramya vani Rayala

Popular

AI/ML

AI/ML

Java

Java

JavaScript

JavaScript

Open Source

Open Source

Slopsquatting: Building a Scanner That Catches AI-Hallucinated Packages Before They Reach Production

May 29, 2026 by Denis Ermakov

LLM-Powered Deep Parsing for Industrial Inventory Search

May 29, 2026 by Andrey Chubin

Zero-Downtime Deployments for Java Apps on Kubernetes

May 29, 2026 by Ramya vani Rayala

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×