Implementing Observability in Distributed Systems Using OpenTelemetry
Jakarta EE 12: Entering the Data Age of Enterprise Java
Platform Engineering and DevOps
Platform engineering and DevOps are merging as organizations scale, modernize, and push to reduce cognitive load across increasingly complex systems. What began as fragmented internal tooling has evolved into Platform-as-a-Product thinking, where internal developer platforms (IDPs), automation pipelines, and golden paths provide the backbone of modern DevOps workflows. Platform teams, DevOps engineers, security teams, and SREs are now working together to deliver consistent, secure, and self-service experiences that improve developer productivity and satisfaction and reinforce operational reliability.This report examines how platform engineering is reshaping DevOps by standardizing environments, unifying toolchains, and shifting repetitive tasks into automated workflows. We explore how teams are implementing developer experience (DevEx) metrics, rethinking CI/CD pipelines, and leveraging AI-driven automation to optimize infrastructure performance and enhance delivery velocity. As enterprises link platform health to business outcomes, measuring ROI and platform adoption is becoming a core initiative.
Shipping Production-Grade AI Agents
Threat Modeling Core Practices
Intro: When Good Models Go Wrong A few years ago, I spent months working on a microservices-based customer intake processing system for our application. The code was good, the tests were passing, and we had load-tested it with crazy high TPS. Yet, on one particular Tuesday afternoon, a small change to the response schema from an upstream service, where the date field changed from ISO 8601 to epoch milliseconds, cascaded through four downstream services and corrupted a day’s transactions without anyone realizing it until it was too late. We fixed it in a few hours, but the lesson has stayed with me, and it’s affected every integration I’ve worked on since then. Crashes are easy to see. Silent data corruption is not. I see the exact same thing happening with AI and machine learning pipelines today. Except now, the consequences are larger, and the feedback cycles are slower. A model will not throw an exception if the input schema changes slightly. It will, however, make worse predictions. Quietly. Confidently. For weeks. In this article, I’d like to propose a solution that brings together two worlds in which I’ve spent my entire professional life: software engineering’s resilience patterns and data governance. What I’d like to argue is the need to combine the concept of data contracts with the Circuit Breaker pattern to build a proactive defense against silent data quality failures that affect AI reliability. The Real Reason AI Models Fail in Production There’s a general understanding that if a model is not performing well, it’s a problem with the model itself, its architecture, its hyperparameters, the training process, etc. Sometimes this is true, but far more solvable. The upstream data changed. Nobody told the model. “Poor data quality is a silent killer of any AI projects.” This is consistent with many production environments. The data engineers did their job, the modelers did their job, and nobody owned the contract between the two. This might manifest in a number of ways: Schema drift: A column that has always been a float type now starts arriving as a string type. The feature engineering process, quietly and behind the scenes, attempts to convert it and introduces some error in the process.Semantic drift: An attribute named "account_status" that previously only had values such as "ACTIVE", "CLOSED", and "DELINQUENT" now starts having a new value, "UNDER_REVIEW", that the model has never seen before. The model maps it to the category that it is closest to in the embedding space, which could be completely incorrect.Distribution shift: The data source the model uses changes how it samples data or alters other aspects of the data. The model is now seeing a different data distribution than it was previously trained on, and the schema looks exactly the same. So, nothing appears to have changed.Cadence changes: A data source that is a batch source and has historically refreshed data every day now starts refreshing data every hour, or vice versa. A 2023 Gartner study found that the primary cause of AI project failure is poor data quality, and more than 60% of organizations reported that data issues, rather than model issues, were the primary cause of most of their production incidents. The diagram below shows the silent changes in the data and how they propagate through the machine learning pipeline. Data changes upstream will flow through the pipeline without error, producing confidently wrong model outputs that will go undetected for weeks. The fundamental issue here is that there is no contract between data producers and data consumers. In a microservices world, we solved this problem a decade ago through API contracts and schema registries. In a data world — and this might sting a little — we're still operating on trust and hope. This is a data governance and data quality issue. And the good news is that the data management community has a conceptual toolkit to solve this problem. We just need to integrate it into the AI pipeline. What Are Data Contracts? Most teams believe they have data contracts. In reality, they have documentation with good intentions. A wiki page with "this field is supposed to be a float" or a Slack channel where someone will ask, "Hey, did the schema change here?" A data contract is not documentation. A data contract is enforceable, but documentation is not. A real data contract is an enforceable agreement between two parties: the producer (the system or team that produces the data) and the consumer (the system or team that consumes the data). A real data contract is an agreement that includes: Schema: The exact structure, data types, and allowed values for every single field.Semantics: What each field means, including business definitions and edge cases.Quality: Minimum quality thresholds for completeness, freshness, accuracy, and uniqueness.SLAs: Service level agreements for delivery cadence, latency, and availability. Versioning: A definition for schema changes, including deprecation schedules for backward-incompatible changes. It’s like thinking of it as a data API specification. Just as OpenAPI (Swagger) has standardized how we specify a REST API, data contracts have standardized how we specify a data interface. It’s a concept that’s been getting a lot of traction among the DataOps community. Andrew Jones has been a prominent influencer in formalizing data contract specifications, and tools like Soda and Great Expectations provide frameworks for data quality expectations, which are part of a data contract. The importance of AI is unparalleled, as every ML model relies on a set of data assumptions that are not only unspecified but also unenforced. When those assumptions are violated, the model starts to deteriorate. A data contract spells out those assumptions, making it testable and enforceable — bringing the level of rigor that data stewardship teams have been advocating for, into the ML pipeline. The Circuit Breaker Pattern: A Primer You already know what a circuit breaker is; there is one in your house. It works by tripping and shutting off the electricity if the load gets too high. You simply flip it back on to restore service. Simple, elegant, and has saved many houses from burning to the ground. The concept of circuit breakers has been around for a long time in software development, popularized by Michael Nygard in his book “Release It!” It has been a standard pattern for building resilient distributed systems. I have been using this concept for a long time. We use Spring Cloud Circuit Breaker based on Resilience4j to handle circuit breakers for our microservices-based application to prevent cascading failures in downstream services, which are very critical to business. The circuit breaker works as follows: Closed state – this is the normal operating state. All requests go through to the downstream service. The circuit breaker is monitoring the failure rate.Open state – this is where the circuit breaker has detected a failure rate above a certain threshold. It has “tripped” and will stop sending requests to the downstream service. Instead, it will immediately send a fallback response or error.Half-open state [recovery probe] – after a cooldown period, the breaker allows a limited number of test requests to pass. If they are successful, the breaker closes; otherwise, it stays in the open position. State machine for circuit breaker here^; the circuit breaker changes states based on failure rates and recovery probes. This pattern has become accessible to every Java developer with the introduction of frameworks such as Spring Cloud Circuit Breaker and Netflix Hystrix. The pattern is simple but very useful. It’s all about failing fast. We have been using this pattern for service-to-service communication for more than a decade. We have 100s of our services with a circuit breaker pattern implemented on our platform. If our XXX critical service goes down, we simply trip the circuit breaker and fail gracefully. But if our upstream data source changes schema silently and starts corrupting our ML features? Nothing. No circuit breaker. No fallback. Just a degradation of our features for weeks. The failure mode is the same: a degraded upstream service silently corrupts a downstream service. But we didn’t have a similar pattern implemented for our data pipelines until we did. Applying the Circuit Breaker to Data Pipelines The basic idea is not as complex as it sounds: we propose that every data input to an AI model is a dependency that can cause a circuit breaker to trip. If we do this with HTTP calls to other microservices, we can do this with data going into a model. While a traditional microservice circuit breaker monitors HTTP request error rate and latency, a data circuit breaker monitors data quality metrics defined in the data contract: Circuit Breaker State Trigger Condition Action Closed (healthy) All contract quality thresholds met Data flows normally into the model pipeline Open (tripped) Quality metrics breach contract thresholds (e.g., null rate > 5%, freshness > 2 hours stale, schema mismatch detected) Data flow is halted; model receives no new input; fallback strategy activates Half-Open (probing) After cooldown, a sample batch is validated against the contract If the sample passes, the breaker closes; if it fails, the breaker stays open The fallback options when the breaker trips can be: Stale but safe – using the last known good data snapshot. The model will continue to run, just on slightly outdated, but still good, data.Graceful degradation – the model will continue to run, but flag its output as "low confidence" and send it to a human for review.Full halt – for high-stakes applications like fraud detection or compliance, the model will simply stop running until the data quality is resolved. This is a fundamental shift from "we'll detect the problem when it happens and send an alert" to "we'll prevent the problem from happening in the first place." Architecture: Data Contracts + Circuit Breakers in Practice Let me walk through a concrete data architecture that ties these patterns together. This is heavily inspired by how we operate this on our lending platform, but adapted for the data to model case: The Data Contract Registry A centralized service responsible for storing all active data contracts. Each data contract is versioned and associated with a data source and a consumer. The service provides APIs for: Registering a data contractValidating data against a data contractPublishing a data contract violation event The Quality Gate A lightweight service (or a 'sidecar' pattern, if you will) that sits in between the data source and the model pipeline. For every data batch or stream event received, the quality gate: Fetches the relevant data contract from the registryValidates data against schema, semantics, and quality rulesReports metrics to the circuit breaker The Circuit Breaker Controller A stateful component that: Aggregates quality metrics from the quality gate over a specified window sizeManages the breaker state (closed, open, half-open)Publishes state change events to a Kafka topic for downstream consumptionExecutes fallback strategies when the breaker is opened The Flow The architecture is an end-to-end solution that includes data contracts, quality gates, and circuit breakers. The circuit breaker is located between the quality gates and the model pipeline, automatically routing to fallbacks if the quality of the data worsens. If you are using AWS, which we are, then this architecture fits nicely with existing AWS services. For example, the quality gate can be performed by a Lambda function or ECS task, the contract registry can be on DynamoDB or other AWS-native datastores, the circuit breaker state can be maintained by ElastiCache (Redis), and the event bus can be on Kafka (or MSK, the AWS variant). We already make significant use of all these tools for our financial platform microservices, so the marginal cost for using them with the data pipeline is negligible. If you are using Kubernetes, then the quality gate can also function nicely as a sidecar container to your model serving pods. The key architectural concept is the separation of concerns. The data producer is responsible for the data contract, the quality gate is responsible for the quality, and the circuit breaker is responsible for the fail-fast. There is no need for a single team to “own” the entire process. From Chaos Engineering to Data Resilience The last time I intentionally broke my data pipeline and saw what happened was? On our system, we do disaster recovery drills regularly — an orchestrated set of exercises on 100+ components, including APIs, batch jobs, and streaming apps. The team is very good at infrastructure chaos engineering. However, when I asked, “What happens if the credit bureau feed starts sending garbage schema for two hours?” nobody answered because nobody had ever really tested this scenario. Most organizations practice chaos engineering on infrastructure, but very few practice data chaos engineering — intentionally introducing data quality errors to see if their systems correctly detect and respond to those errors. Data Chaos Engineering in Practice Schema injection: Apply a schema modification temporarily, for example, by adding a column or changing a data type. Validate that the quality gate detects this modification and the circuit breaker is triggered.Null injection: Increase the proportion of null values for a critical feature beyond the contract value. Validate that the breaker is triggered.Staleness simulation: Apply a delay in the data delivery beyond the SLA value. Validate that the staleness check is triggered.Distribution poisoning: Apply a small perturbation to the distribution of a critical feature. Validate the detection. The data chaos engineering cycle. Here, faults are injected to ensure that the contracts and breakers are functioning correctly. The missing pieces are fed back into the contract and breaker development. I have seen that by running these experiments every month, taking the same level of discipline that we already take in running our existing DR drills for our services, instills enormous confidence in the system's ability to look after itself. It also reveals missing pieces in your data contracts that you might never find by just reviewing your documentation. If you introduce a fault and nothing catches it, that means your contract is incomplete. We learned that we had three missing contract clauses just by running data chaos experiments for the first month. The principles of chaos engineering are applicable in this case. You are not testing if your system works under perfect conditions; you are testing if your system fails safely under realistic, degraded conditions. Real-World Scenario: Stopping a Bad Prediction Before It Ships For example, a financial services company might use ML models to predict customer behavior for risk analysis. The ML model might use various data sources as features, such as an external third-party data provider for customer risk indicators. The scenario: A third-party vendor changes their API and doesn't notify anyone. A critical field in the data set now returns numeric data instead of categories. The field previously returned HIGH_RISK, MEDIUM_RISK, LOW_RISK, and MINIMAL_RISK categories, but now it returns numeric data between 1 and 100. The ETL process doesn't fail but defaults to a mapping of the data, which essentially flattens all the risk into a single category across all customers. Without a data contract and circuit breaker: The model runs for weeks with corrupted features. Predictions are no longer accurate, but the gradual change is mistaken for market conditions or seasonality. By the time the actual cause is determined, thousands of decisions are made based on incorrect predictions. The process to address the problem involves several teams working in war rooms over the course of days, analyzing logs and assessing the damage, a considerable engineering and possibly business waste. With a data contract and circuit breaker: The data contract is very specific in that it requires the risk indicator field to contain one of four string values. If the vendor changes the format of the API, the quality gate immediately recognizes that the data is not passing the semantic validation. The circuit breaker is triggered within minutes. The system defaults to the last verified snapshot of the data and flags all predictions as "Degraded Confidence." An alert is sent to the data engineering team. The schema is fixed within hours, and zero corrupted predictions are ever made. The speed is a secondary benefit, the actual value is in the prevention of damage (as a preventative control rather than a detective). The circuit breaker prevented the bad data from entering the model before the corrupted prediction was ever made. FAQs What is the difference between a data contract and a schema registry, e.g., Confluent Schema Registry? A schema registry will verify structure, e.g., field names, data types, and nesting. A data contract extends that with semantic rules, e.g., allowed values, definitions, quality rules, e.g., nulls, freshness, and SLAs, e.g., delivery cadence, availability. In other words, the schema registry is just part of the data contract. Won't triggering circuit breakers cause the model to stop working too often? This is not a fundamental flaw; it's just a calibration issue. People often underestimate the amount of variation that is normal in their data. We did. Start with large values, then adjust them once you know your data's normal behavior. The half-open state helps with recovery. In practice, circuit breakers will not often fail, and when they do, it's likely due to real issues. Does this apply to real-time streaming data, or is it limited to batch data? Both. For streaming, the quality gate checks every event or micro-batch. The circuit breaker aggregates metrics over a time window. For batch, the quality gate checks at the batch level, prior to writing to the feature store. This pattern is unaware of the delivery mechanism. What about unstructured data, like text and images? For unstructured data, like text and images, the data contracts are concerned with other quality aspects, like encoding, language, document size, and metadata. The Circuit Breaker still applies, just to other metrics. For example, in an image processing pipeline, if 90% of the images received are 90% smaller than the average, it could be a sign of corrupted images or thumbnail images only. How do I get data producers to adopt contracts? Start with the highest value, highest risk data sources. Present it in the context of reducing their support load. The producer team is interrupted every time a consumer reports a bug because of the change in the data. I have been in enough cross-team incident reviews to know that these interruptions are not popular. Contracts remove the need for these interruptions. Once one producing team has adopted contracts and seen the reduction in downstream incidents, the rest tend to spread naturally. We began with a data feed and now have contracts in place for our most critical internal data sources. Conclusion The data engineering community has spent years developing ever-more sophisticated monitoring, alerting, and observability tools. That's all been good work. But let's be honest: monitoring is fundamentally reactive. Monitoring just lets you know something's gone wrong... after the damage is done. You want monitoring and prevention, but only prevention will stop the damage before it happens. Data contracts and circuit breakers are a fundamental shift in data resiliency: Contracts make the expectations explicit. Circuit breakers make those expectations active, in real time, before the bad data ever gets to the models and agents that rely on it. When building AI systems that make critical decisions... and increasingly, all of us are doing this... You simply cannot operate on implicit trust between data producers and data consumers. The chasm between "the data exists" and "the data is fit for purpose" is where model reliability goes to die. The data governance and data quality practices that this community has advocated for over the years are precisely what you need. And now, taking them to the AI layer is what's next. Bridge the gap. Write the contract. Wire the breaker. Start with one data source, the one that has burned you before. You know the one. Your models will thank you. Key Takeaways The cause of AI system failure is data, not code. The most common cause of production AI system failure is a change in data schema or semantics, which degrades model predictions silently.Data contracts make data producer and consumer expectations around schema, semantics, data quality thresholds, and SLAs explicit, making implicit assumptions explicit and testable.The Circuit Breaker pattern stops bad data from being fed to a model by automatically stopping data flow when data quality thresholds are violated, allowing for fallbacks to be implemented.Data chaos engineering makes you confident that your data contracts and circuit breakers will work when your data quality actually fails by intentionally inducing data quality failures.Target high-value, high-risk data sources first. Success in one area can generate enough organizational momentum for wider application.
If you’re planning to build an application using GenAI, it might seem like something completely new and complicated — but honestly, it’s not very different from building any other application. Just like any project, you still need a clear lifecycle and a proper framework to design, build, and successfully deploy your solution to production. Think of it like building a house. You don’t just start putting bricks together — you first plan, design, gather materials, and then build step by step. GenAI projects work the same way. Without a structured approach, things can quickly become messy and unpredictable. I’m sharing this not just from theory, but from real-world experience. I’ve worked on multiple GenAI use cases and taken them all the way to production. Along the way, I’ve learned what works, what doesn’t, and what truly matters. So, let’s walk through the lifecycle and framework together in a simple and practical way. GenAI UC implementation lifecycle Requirement Gathering and Use Case Definition When you begin building a GenAI use case, the first and most important step is understanding what you’re actually trying to solve. It’s a bit like starting a journey — you need to know your destination before choosing the route. This is where requirement gathering and use case definition come in. At this stage, you sit down and clearly understand the problem, the goal, and what success looks like. Broadly, these requirements fall into two categories. Functional Requirements This is all about what the system should do. For example, imagine you want to help a customer support engineer quickly find answers using natural language or build a system that can analyze customer sentiment from feedback. These define the core purpose of your application. In fact, everything you design later — architecture, prompts, integrations — will revolve around these functional goals. Non-Functional Requirements NFRs focus on how the system should behave. This includes things like security, performance, scalability, and response time. Think of it this way: your system might give the right answer (functional), but if it’s slow, insecure, or can’t handle users at scale, it won’t succeed. These requirements quietly shape your system design and are just as important as the functional ones. So before jumping into building anything, taking the time to clearly define both types of requirements sets a strong foundation for everything that follows. Data Strategy At first glance, it might feel like GenAI models already know everything — they’re pre-trained, after all. But when you start building real-world applications, you quickly realize that data still plays a huge role. Think of it this way: the model is like a very smart assistant, but it doesn’t know your company’s knowledge, your documents, or your business context. So, you need to bring that into the picture. This means identifying and organizing your internal data — things like documents, knowledge bases, FAQs, or any structured information your system should rely on. But it’s not just about having data — it needs to be clean, well-structured, and meaningful. If your data is messy or inconsistent, the responses from your GenAI system will reflect that. At this stage, you also make an important decision: should you fine-tune the model, or use RAG (Retrieval-augmented generation) to fetch relevant data at runtime? In most real-world scenarios, RAG is often the preferred approach because it allows your system to stay up-to-date and grounded in actual data. So even in a GenAI world, data is still the foundation — it’s what turns a smart model into a truly useful solution. Model Evaluation and Selection When you start working with GenAI, one thing becomes clear very quickly — not all models are the same. They come in different sizes, capabilities, and even modalities (text, image, etc.). It might be tempting to think one powerful model can handle everything, but in reality, no single model fits all use cases. Choosing the right model is a careful balance. A larger model might give better accuracy, but it could also increase cost and latency. A smaller model might be faster and cheaper, but may not perform well for complex tasks. So, selecting the right model is not just a technical choice — it directly impacts your application’s performance and user experience. To make this decision, you need to evaluate models based on your specific use case. One approach is to build your own evaluation framework using test datasets and measure metrics like accuracy, precision, recall, F1-score, or even BERT score. This gives you a more hands-on and tailored understanding of how well a model performs for your needs. Another approach is to use built-in evaluation tools provided by cloud platforms like Amazon Web Services, Microsoft Azure, or Google Cloud Platform, which can speed up the process and provide useful insights. In some cases, just selecting a pre-trained model isn’t enough. You may need to adapt it to better align with your business goals. This can be done using techniques like RAG (Retrieval Augmented Generation), fine-tuning, or model distillation, depending on your requirements. So, model selection isn’t just about picking what’s available — it’s about finding the right fit for your use case and shaping it to deliver the best results. Prompt Engineering When users interact with a GenAI application, they usually ask questions in simple, natural language—just like they would talk to a person. These queries are often unstructured and don’t clearly define intent, tone, or constraints. For example, a user might simply ask: "Why was my card declined?" Now, if you send this directly to the model, it may give a response—but it could be inconsistent, vague, or even incorrect. That’s because there are no rules or guidance provided to the model. This is where the idea of a prompt “envelope” comes in. Think of it like wrapping the user’s query with instructions that guide the model on how to behave. This envelope adds: Clear instructionsTone controlConstraintsContext So instead of sending just the raw query, you transform it into a structured prompt like this: Define the role (e.g., banking assistant)Add rules (don’t guess, be concise)Provide context (card blocked for international usage)Specify expected output Now, the same simple question becomes part of a much more controlled and meaningful interaction. Plain Text System: You are a banking support assistant. Rules: - Do not guess - Use only provided context - Be polite and concise User Query: Why was my card declined? Context: - Card blocked for international usage Output: Provide reason and steps to resolve. The best part? You don’t have to build this every time. These structured prompts can be saved as centralized prompt templates. Then, based on the use case, your GenAI application can: Load the right templateInject the user’s query dynamicallySend the final prompt to the model This approach ensures your responses are not just intelligent but also consistent, reliable, and aligned with your business needs. Architecture Design Once you have your use case, data, and model in place, the next step is to figure out how everything comes together — and that’s where architecture plays a key role. Think of architecture like the blueprint of a house. You may have the best materials, but if they’re not put together properly, the structure won’t be stable. In the same way, a GenAI application needs a well-thought-out design to ensure all components work smoothly and reliably. At this stage, you define how different pieces — like data sources, prompt templates, orchestration logic, and models — connect and interact with each other. This is important because even powerful models from OpenAI can give poor or inconsistent results if the data flow is not clear, the context is missing, or proper guardrails are not in place. A strong architecture ensures that your system integrates well with enterprise applications, applies necessary validations, and includes monitoring for better control. It also helps you design for scalability, performance, and cost optimization, so your solution can handle real-world usage. In simple terms, good architecture is what turns your GenAI idea into a stable, secure, and production-ready system, rather than just a working prototype. High-level architecture Integration and Development The integration and development stage is where your GenAI idea truly comes to life as a working application. This is the practical, hands-on phase where you start connecting systems, building logic, and enabling the AI to function inside real business workflows. In enterprise environments, this is usually done using integration platforms along with large language models to orchestrate everything end to end. At this stage, you build complete workflows that include API integrations, prompt generation logic, data retrieval pipelines, and safety mechanisms like content filtering and prompt validation. This is also where you ensure that your GenAI system is not working in isolation but is properly connected to enterprise systems, because LLMs by themselves do not understand your business data—you must fetch and supply it through integrations. A key part of this phase is building a prompt pipeline, which prepares a structured prompt before sending it to the model. At a high level, this pipeline works in steps: Step 1: Load the prompt templateStep 2: Inject the user queryStep 3: Inject relevant context dataStep 4: Add constraints and instructionsStep 5: Send the final prompt to the LLM Along with this, you also build a RAG (Retrieval Augmented Generation) pipeline, which ensures your model gets the right knowledge at runtime. RAG typically has two parts: an ingestion process and a retrieval process. In the ingestion phase, documents are converted into embeddings and stored in a vector database. In the retrieval phase, the system searches for relevant information and injects it into the prompt before sending it to the model. At a high level, the RAG pipeline works like this: Convert documents into embeddingsStore them in a vector databaseAt runtime: Search for relevant dataInject it into the prompt Together, these components ensure that your GenAI application is not just intelligent but also context-aware, reliable, and ready for real-world use. Security Governance and Guardrails At this stage, you focus on building and integrating security and guardrails into your GenAI application to ensure it behaves safely and responsibly in real-world usage. This can be done either by implementing your own controls or by integrating third-party safety solutions that continuously monitor both inputs and outputs and take action whenever something risky is detected. The main goal here is to make sure the system does not generate harmful, sensitive, or unintended responses while still maintaining usefulness and accuracy. At a high level, this layer takes care of key safety responsibilities such as: Content moderation – detecting and filtering unsafe, toxic, or inappropriate contentPII masking – identifying and hiding sensitive personal information like names, account numbers, or emailsOutput validation – ensuring responses follow the expected format, structure, and business rulesPrompt injection protection – preventing malicious user inputs from overriding system instructions In simple terms, this stage acts as a protective shield around your GenAI system, ensuring that every interaction stays secure, compliant, and aligned with business and ethical standards. Testing and Evaluation Testing and evaluation in GenAI projects are essential to make sure your system delivers accurate, reliable, safe, and consistent outputs both before and after deployment. Unlike traditional software testing, where outputs are fixed and predictable, GenAI testing deals with probabilistic responses, meaning the same input can produce different outputs. Because of this, the focus is not just on exact results, but on quality, behavior, and robustness of the system. In practice, we perform different types of testing to validate the overall performance of the GenAI application. Some of the key ones are: Functional Testing This ensures the system is performing the intended task correctly. For example, if the prompt is “Summarize this document”, the output should be a proper summary and not a detailed explanation or translation. It validates whether the core functionality is working as expected. Prompt Testing This focuses on checking whether prompts are guiding the model properly. You verify if instructions are being followed, output formats are correct, and the tone and behavior of the response remain consistent across different inputs. Output Quality Testing This testing evaluates how good the response actually is. It checks whether the output is relevant, accurate, complete, and clearly written, ensuring it meets user expectations. Safety and Compliance Testing This ensures the system behaves in a responsible and secure manner. You check that there is no PII leakage, no biased responses, and no toxic or harmful content being generated. Performance Testing This validates whether the system meets performance requirements such as response time and throughput. It ensures the application is fast and scalable enough for real-world usage. Cost Testing This is a critical aspect in GenAI systems. You ensure that token usage and cost metrics are properly captured for every request and response, so the system remains financially optimized and transparent. Overall, this stage ensures your GenAI application is not only functional but also reliable, safe, and production-ready. Deployment Deployment in a GenAI project is the stage where your solution moves from testing into a live production environment, where real users can interact with it. Unlike traditional software deployment, GenAI deployment is not just about hosting an application — it also involves managing models, prompts, data pipelines, integrations, and guardrails in a coordinated, reliable, and scalable way. In real-world enterprise systems, you will notice that different components are deployed as independent units, and careful cutover planning becomes very important. For example, the model layer deployment is handled separately, the integration or orchestration layer is deployed as another unit, and the consumer application, like a chatbot or web interface, is deployed independently. Similarly, the RAG pipeline, including embedding generation and vector database setup, can also be deployed and scaled as a separate component. Beyond just deployment, this stage also includes critical activities such as environment configuration (dev, test, prod), CI/CD automation, versioning of prompts and models, and rollback strategies in case something goes wrong. You also need to ensure that monitoring and logging are enabled from day one so that system behavior, cost, latency, and errors can be tracked in real time. In addition, proper cutover planning is essential to ensure a smooth transition from old systems to GenAI-powered systems. This includes phased rollouts, canary releases, and fallback mechanisms to avoid business disruption. In simple terms, GenAI deployment is not a single release — it is a coordinated rollout of multiple interconnected components, each working together to deliver a stable, scalable, and production-ready AI solution. Monitoring and Observability Monitoring and observability in a GenAI project ensure that your system remains reliable, accurate, cost-efficient, and safe once it is live in production. Since GenAI outputs are probabilistic in nature, unlike traditional APIs that produce fixed responses, you need much deeper visibility into what the model is doing, why it is behaving in a certain way, and how its responses are impacting users and business outcomes. Monitoring focuses on continuously tracking the health and performance of the system by collecting logs, metrics, and alerts. It helps you understand what is happening in the system and notifies you when something goes wrong, such as increased latency, failures, or unusual cost spikes. On the other hand, observability goes a step further by using detailed telemetry data—such as logs, metrics, and traces—to help you understand why a particular issue is happening, especially in complex and distributed GenAI architectures. In simple terms, monitoring tells you that there is a problem, while observability helps you diagnose the root cause. In GenAI systems, it is important to continuously log and track both output quality metrics and performance metrics. These include: Relevance of responsesAccuracy of generated outputHallucination rate (incorrect or made-up responses)Format compliance (e.g., JSON or structured output adherence)Latency (response time)Throughput (requests per second)Timeout and failure rates To effectively manage this, organizations use external observability and monitoring tools such as Datadog, ELK Stack, New Relic, Dynatrace, and Splunk, which provide powerful capabilities for visualization, analytics, alerting, and real-time system monitoring. Overall, this stage ensures that your GenAI application is not just working, but is continuously measurable, diagnosable, and optimizable in a live production environment. Continuous Improvement Continuous improvement in a GenAI project means regularly enhancing the system even after it is deployed so that it stays accurate, reliable, and useful for business needs. Unlike traditional applications, GenAI systems keep learning and evolving based on how real users interact with them. Because of this, teams continuously review prompts, responses, and performance data to find problems like incorrect answers, bias, hallucinations, or inconsistent outputs. Improvements are then made step by step — by refining prompt templates, improving or updating data sources (especially in RAG systems), adjusting model settings, or sometimes upgrading to better models from providers like OpenAI. Integration and orchestration platforms also play an important role because they allow changes in workflows, APIs, and data connections without breaking the entire system. In simple terms, continuous improvement is a loop of monitoring, feedback, testing, and fixing, which ensures the GenAI application keeps getting better over time and continues to meet changing user expectations and business goals. Conclusion To wrap it all up, think of a GenAI project as building and growing a living system — not a one-time software build. It all starts with identifying the right problem to solve, then slowly shaping it through data, models, prompts, design, development, testing, and monitoring. Each stage plays its own important role in making sure the system is not just intelligent, but also reliable, safe, and scalable in real-world use. Unlike traditional software projects that are built once and deployed, GenAI solutions are more like a journey. They keep evolving — prompts are refined, data is improved, workflows are optimized, and outputs are continuously evaluated to match user expectations. This ongoing refinement is what keeps the system relevant and effective over time. With the right orchestration platforms and powerful models from providers like OpenAI, organizations can build intelligent solutions that fit smoothly into their enterprise ecosystem and adapt as business needs change. Note to Readers If you found this article helpful, feel free to drop a comment and let me know! I’d be happy to go deeper into each phase and explain it with real-world examples in a simple and practical way.
Every cache miss is a small but persistent cost on your system. Individually, a single miss may seem insignificant. At scale — thousands or millions of requests — these misses accumulate into measurable latency, increased database load, and degraded user experience. Most systems do not slow down because of one expensive query. They degrade over time due to repeated inefficiencies. Cache misses are one of the most common and overlooked contributors to this pattern. In modern distributed systems, where services depend on multiple layers of infrastructure, even small inefficiencies propagate quickly. What starts as a single cache miss can ripple through downstream systems, amplifying latency and increasing resource consumption. The Hidden Cost of Cache Misses A cache miss is not simply a delay. It triggers a sequence of additional work: A database query or downstream API callNetwork latency and serialization overheadIncreased CPU and I/O utilizationAdditional pressure on shared infrastructure When this pattern repeats under load, even a small drop in cache efficiency can lead to: Increased response timesHigher infrastructure costElevated load on backend systemsGreater likelihood of cascading failures These effects compound quickly in distributed systems, where dependencies amplify delays across services. In high-throughput environments, this can result in bottlenecks that are difficult to diagnose because the root cause appears trivial at an individual request level. What Is Cache Hit Ratio (and Why It Matters) Cache hit ratio measures how often requests are served from cache instead of reaching the primary data store. Cache Hit Ratio (%) = (Cache Hits / Total Lookups) × 100 While it appears to be a simple metric, it reflects how effectively a system avoids unnecessary work. A higher hit ratio typically results in: Lower latencyReduced backend loadImproved scalability A lower hit ratio indicates that the system is repeatedly performing operations that could have been avoided. Over time, this inefficiency translates into increased operational cost and reduced system reliability. Architecture Overview The diagram below compares a cache hit and a cache miss flow. The cache hit path (green) represents a short execution path where data is served directly from cache.The cache miss path (red) illustrates a longer path involving database queries, increased latency, and additional system load. This comparison highlights a fundamental principle: not all requests carry the same cost. Some require significantly more resources than others. Cache hit vs cache miss flow illustrating how cache misses introduce latency, backend load, and system cost at scale. A Simple Example Consider a service receiving 10 requests: The first request results in a cache miss, queries the database, and stores the resultThe next 9 requests are served from cache This results in: 9 cache hits out of 10 → 90% cache hit ratio At a small scale, this level of efficiency appears acceptable. However, even in this simple scenario, the first request is significantly more expensive than the rest, demonstrating how cache misses introduce uneven cost distribution. What Happens at Scale The impact becomes more visible as traffic increases. At a small scale: A cache miss introduces a minor delay At a large scale: A 1% drop in cache hit ratio can result in thousands or millions of additional backend calls This leads to: Increased latency across requestsHigher load on databases and servicesGreater risk of timeouts and failures In distributed architectures, this can trigger cascading effects, where delays propagate across multiple services and amplify system instability. Systems that perform well under normal conditions may degrade rapidly during traffic spikes due to inefficient caching strategies. Trade-Offs: Performance vs Freshness Caching introduces an inherent trade-off between performance and data consistency. Serving data from cache improves latency and reduces backend load, but it also introduces the possibility of stale data. Key considerations include: Strong consistency ensures data accuracy, but increases latencyEventual consistency improves performance but requires tolerance for temporary staleness Techniques such as cache invalidation, write-through caching, and event-driven updates can help manage this balance effectively. The right approach depends on business requirements and tolerance for data freshness. Implementation Considerations Effective caching requires more than introducing a cache layer. Cache warming is essential during deployments or cold starts. Without it, systems experience an initial surge in cache misses that can overwhelm backend dependencies. Time-to-live (TTL) tuning must be handled carefully: Short TTL values lead to frequent expirations and increased missesLong TTL values risk serving stale data Cache key design plays a critical role. Poorly structured or inconsistent keys lead to cache fragmentation, reducing overall effectiveness. Failure handling must also be considered. Systems should handle cache failures gracefully without triggering retry storms or excessive backend load. Real-World Impact In production environments, cache inefficiencies often manifest as: Spikes in database CPU usageIncreased API latency during peak trafficUnexpected infrastructure scalingPerformance degradation after deployments Organizations often scale infrastructure to address these issues. However, in many cases, the underlying problem is inefficient caching rather than insufficient capacity. Improving cache efficiency is one of the most cost-effective ways to enhance system performance and stability. What Is a Good Cache Hit Ratio? There is no universal threshold, but general benchmarks include: Database query caches: 85–95%API response caches: 95–99%Content delivery networks: 99%+ The objective is not to achieve perfection, but to minimize avoidable backend operations. How to Reduce the Cache Miss Tax 1. Preload Frequently Accessed Data Warm caches during deployments to reduce cold-start impact. 2. Tune TTL Carefully Balance expiration timing with data freshness requirements. 3. Use Predictable Cache Keys Ensure consistency and avoid unnecessary misses. 4. Monitor Continuously Track cache hit ratio alongside latency, backend load, and error rates. Conclusion A high cache hit ratio improves performance, but it should not come at the cost of serving outdated data. The goal is not to cache everything, but to cache strategically based on access patterns and system requirements. Every cache miss represents additional work performed by the system. At scale, these small costs accumulate into measurable performance degradation. Reducing cache misses is not only an optimization — it is a foundational requirement for building scalable, efficient systems.
Modern microservice architectures consist of many independently deployable services, which brings new security challenges. One crucial best practice is to use an API Gateway as a centralized entry point to enforce security policies. In this article, we explore how to implement a secure API gateway in a microservices environment and demonstrate authentication configuration with code examples. Why Use an API Gateway for Microservices Security In a microservices architecture, each service exposes its own REST endpoints. Without a gateway, clients would have to authenticate individually with each service, a complex and error-prone approach. An API Gateway acts as a single entry point for all client requests, simplifying communication and centralizing cross-cutting concerns like security. As Chris Richardson notes, the API Gateway is also an ideal place to implement cross-cutting concerns, such as authentication. By routing all external traffic through a gateway, you can offload tasks like authentication, authorization, encryption, and rate limiting to this layer. Some key benefits of using an API gateway for security include: Unified access control: The gateway enforces robust access controls in one place. This avoids duplicating auth logic in every microservice and ensures consistent policies.Isolation of internal services: Microservices are not exposed directly to clients. The gateway shields internal APIs, preventing unauthorized access and reducing the attack surface. Backend services can trust that requests have passed through security checks.Monitoring and logging: As the single entry point, the gateway can log requests and monitor traffic for security analytics.Other edge functions: API Gateways often handle routing to the correct service, load balancing, input validation, and rate limiting to mitigate DDoS attacks. Centralizing these functions improves efficiency and maintainability. In summary, a gateway in front of your microservices allows you to apply consistent security measures across all services and simplify each service’s implementation. The microservices can remain focused on business logic while the gateway manages authentication and other front-line defenses. Open-Source API Gateway Solutions There are numerous technologies available to implement the API Gateway pattern. Since we are focusing on open-source solutions, here are a few popular choices: Kong gateway: An open source, high-performance API gateway built on NGINX. Kong supports flexible routing rules and a rich plugin ecosystem for authentication, rate limiting, transformations, and more.Envoy proxy: A modern L7 proxy often used in service meshes. Envoy can serve as an edge gateway with filters for features like JWT verification.Traefik: A cloud-native edge router written in Go. Traefik integrates well with orchestrators and provides middleware for things like basic authentication, OAuth/OIDC via forward-auth, and automatic TLS. It’s often used for its easy dynamic configuration and Let’s Encrypt integration.Others: There are open-source API management solutions like Tyk, Gravitee, WSO2 API Manager, KrakenD, etc. Each offers varying features, but at their core, they all provide gateway functionality to route and secure microservice APIs. Each of these solutions can secure your microservices, but their configuration and feature set differ. In the next section, we’ll focus on implementing authentication using Kong API Gateway as an example. Kong is a popular choice due to its lightweight nature and plugin flexibility, but the concepts will be similar for other gateways. Implementing Authentication in an API Gateway To illustrate how to implement a secure API gateway, we’ll walk through setting up Kong Gateway in front of a microservice and enabling authentication. The goal is to require a valid JSON Web Token (JWT) for clients calling the microservice via the gateway. 1. Defining Services and Routes in Kong First, we need to configure Kong with a Service and a Route for our microservice. In Kong’s terminology, a Service object represents an upstream microservice, and a Route defines how requests from clients map to that service. Kong will listen for client requests on the route and forward them to the specified service. For example, we create a declarative configuration file for Kong in DB-less mode. Below is a snippet of kong.yml defining a service and route, and then attaching the JWT authentication plugin to that service: YAML _format_version: "2.1" services: - name: my-api-service url: http://localhost:3000 # upstream microservice URL routes: - name: api-route service: my-api-service paths: - /api # clients will call /api on the gateway plugins: - name: jwt service: my-api-service enabled: true config: key_claim_name: kid # use 'kid' field in JWT header to identify the key claims_to_verify: - exp # ensure the 'exp' (expiry) claim is valid In this config, we map my-api-service to the upstream at localhost:3000, and we route any requests with path /api on the gateway to that service. The jwt plugin is enabled on the service, which means Kong will require a valid JWT on all requests to this service. We configured the plugin to check the token’s exp claim and to expect a kid claim in the JWT header to identify the signing key. After loading this config and starting Kong, any request to http://<gateway>:8000/api will be intercepted by Kong. Since the JWT plugin is active, Kong will attempt to find a JWT in the request and validate it. If no token is present or the token is invalid, Kong will respond with 401 Unauthorized. At this point, our microservice is protected only authenticated calls should be forwarded. 2. Configuring Credentials (JWT Issuers and Secrets) Now that Kong is blocking unauthorized requests, we need to configure who is considered an authorized consumer and what credentials (JWTs) are accepted. In a real system, you would likely integrate with an Identity Provider or authorization server that issues JWTs. Kong uses the concept of Consumers to represent clients or user identities that consume your APIs. Each consumer can have credentials associated with it. We will add a consumer entry and a JWT secret for that consumer in our kong.yml: YAML consumers: - username: auth-service jwt_secrets: - consumer: auth-service key: my-issuer-key-123 # 'kid' value expected in the JWT header secret: my-jwt-signing-secret In this snippet, we added a consumer named auth-service. We then added a JWT credential for that consumer with a secret and a key. The key is a unique identifier and the secret is the HMAC secret that this consumer will use to sign JWTs. Essentially, we are telling the gateway to accept JWTs signed with my-jwt-signing-secret, as long as they carry kid: my-issuer-key-123 in their header, and treat them as coming from the auth-service consumer. With this configuration, the JWT plugin knows how to verify incoming tokens: it will look at the kid claim in the JWT header to find the matching consumer’s secret, then verify the token’s signature using that secret. It also checks the exp claim to ensure the token has not expired. 3. Testing the Secured Gateway Now the secure gateway is configured. Let’s quickly illustrate the behavior with example requests: PowerShell # Attempt to call the service without any token $ curl -i http://localhost:8000/api HTTP/1.1 401 Unauthorized ... # Now, obtain or craft a JWT signed with 'my-jwt-signing-secret' and kid 'my-issuer-key-123'. # For demonstration, we can use an online tool or a JWT library to create a token: # Header: { "alg": "HS256", "typ": "JWT", "kid": "my-issuer-key-123" } # Payload: { "sub": "user123", "exp": <some future timestamp>, ... } # Sign it with the secret. # Call the API with the JWT in the Authorization header $ curl -i -H "Authorization: Bearer <your_jwt_token_here>" http://localhost:8000/api HTTP/1.1 200 OK Hello world! As shown, without a valid JWT, the gateway returns a 401 Unauthorized response. With a valid JWT, Kong will authenticate the request and route it to the upstream service, which returns the expected data (HTTP 200 OK). The microservice itself did not need to implement any auth checks; the gateway handled it. Conclusion Implementing a secure API gateway for microservices involves setting up a robust gateway solution and offloading security concerns to it. We demonstrated how to use Kong to enforce JWT authentication in front of a microservice. The gateway approach streamlines authentication across all services. Once a request is verified at the edge, microservices can trust that identity and operate in a zero-trust, defense-in-depth manner. Open source API gateways like Kong, Envoy, and Traefik provide the building blocks to authenticate and authorize traffic, handle encryption, and apply policies uniformly. By centralizing these concerns, engineering teams can avoid duplicating security code across microservices and instead manage it in one place. As a result, the overall system becomes easier to secure and maintain. For advanced scenarios, from integrating with enterprise SSO/IdPs to implementing multi-tenant auth or fine-grained access control, the gateway can be extended with plugins or external auth services. The key is to establish the gateway as the trust barrier between clients and your microservices. With a secure API gateway in place, a microservices architecture can achieve both agility and strong security compliance.
The Aberration We build Java applications like Go or Rust programs. Fat JARs. Docker images. Kubernetes deployments. Everyone does it, so it looks normal. It contradicts Java’s design DNA. Java has always been a language for managed environments. Applets ran inside browsers. Servlets ran inside application servers. EJBs ran inside containers like JBoss and WebLogic. OSGi bundles ran inside runtime containers like Eclipse Equinox. In every generation, the pattern was the same: a managed runtime hosts the application. The application handles business logic. The runtime handles infrastructure. The fat-jar era threw that away. We stopped letting Java be Java. We started bundling web servers, serialization frameworks, service discovery clients, configuration management, health checks, metrics libraries, and logging frameworks into every application. Then we wrapped the result in a Docker container and deployed it to an orchestration platform that reimplements — poorly — the infrastructure management that Java runtimes used to provide natively. This article introduces Pragmatica Aether: a distributed runtime that returns Java to its natural habitat. The application handles business logic. Runtime handles infrastructure. This isn’t radical — it's returning to what Java was designed for. The Problem: Infrastructure Wearing a Business Logic Mask Think of what a typical Java microservice carries. A web server (Tomcat, Netty, Undertow). A serialization framework (Jackson, Gson). A dependency injection container (Spring, Guice). A service discovery client (Eureka, Consul). Health check endpoints. Configuration management (Spring Cloud Config, Consul KV). A metrics library (Micrometer, Dropwizard). A logging framework (Logback, Log4j2). Retry logic (Resilience4j). Circuit breakers. HTTP client configuration. The application is wearing a heavy winter coat of infrastructure, armed to the teeth to survive in a hostile environment. Now consider the coupling this creates. Update the Java version — rebuild and test every service. Change your message broker from RabbitMQ to Kafka — modify, rebuild, and redeploy every application that touches messaging. Add a new observability tool and update dependencies in every microservice. Switch cloud providers — rewrite configuration, SDK calls, and deployment manifests across the entire fleet. Each change ripples through dozens or hundreds of services because infrastructure is entangled with business logic at the dependency level. This is the coupling trap. Your application’s pom.xml doesn't distinguish between business dependencies and infrastructure dependencies. They compile together, deploy together, and break together. A security patch in Netty requires a new build of every service that embeds a web server, which is all of them. Framework lock-in worsens this. It isn’t a vendor problem — it's an architecture problem. Spring’s dependency injection fights with Kubernetes service mesh for control over service routing and circuit breaking. The framework’s configuration system overlaps with Consul KV and Kubernetes ConfigMaps. Your cloud SDK’s retry logic conflicts with Resilience4j. Every layer claims authority over the same cross-cutting concerns, and the conflicts surface as subtle bugs in production — not during development. This is an architecture problem. Architectural problems have architectural solutions. Aether: The Core Idea What you write: an interface annotated with @Slice, plus business logic implementation. Java @Slice public interface OrderService { Promise<OrderResult> placeOrder(PlaceOrderRequest request); static OrderService orderService(InventoryService inventory, PricingEngine pricing) { return request -> inventory.check(request.items()) .flatMap(available -> pricing.calculate(available)) .map(priced -> OrderResult.placed(priced)); } } What you don’t write: everything else. No HTTP clients — inter-slice calls are direct method invocations via generated proxies. No service discovery — the runtime tracks where every slice instance lives. No retry logic — built-in retry with exponential backoff and node failover. No circuit breakers — the reliability fabric handles failure automatically. No serialization code — request/response types are serialized transparently. A method call via an imported interface is the only visible contract. The only hint that the actual call might be remote is a design requirement: slice methods should be idempotent. This isn’t a limitation — it's what enables retry, scaling, and fault tolerance to work transparently. The same request, processed by any available instance, produces the same result. Most read operations are naturally idempotent. For writes, standard patterns like idempotency keys and conditional writes handle it cleanly. Everything else is the environment’s job: resource provisioning, scaling, transport, discovery, retries, circuit breakers, configuration, observability, logging, tracing, monitoring, and security. None of these are application concerns, and none should be handled at the business logic level. The JBCT Leaf pattern serves two purposes here: it documents the design (“what we expect from an external implementation”) and encourages exactly one interface per dependency. Different implementations may have different technical properties — performance, latency, memory consumption — but as long as they’re compatible with the interface, business logic works unchanged. You write basically pure business logic that scales from your local computer to a global multi-zone distributed deployment, transparently. Under The Hood: What Makes It Work Five architectural decisions make this possible. Consensus KV Store. A single source of truth for all configuration, deployment state, and service discovery. Based on the Rabia protocol, a crash-fault-tolerant, leaderless consensus algorithm was published in 2021. Any node can propose; agreement is reached through a two-round voting protocol with a fast path when a supermajority agrees in round one. No external config servers. No etcd. No Consul. Configuration changes propagate through consensus and take effect cluster-wide. Built-in Artifact Repository. DHT-based storage with configurable replication — 3 replicas with quorum reads/writes in production, full replication in development. Artifacts are chunked into 64KB pieces, distributed across nodes via consistent hashing, and integrity-verified with MD5 and SHA-1 on every resolve. No external Nexus or Artifactory is needed. During development, slices resolve from your local Maven repository. In production, the cluster is self-contained. ClassLoader Isolation. Each slice runs inside its own SliceClassLoader with child-first delegation. Two slices can use different versions of the same library without conflict. Shared dependencies like Pragmatica Lite core are loaded once in a parent classloader. No dependency conflicts. No classpath hell between slices. Declarative Deployment. Blueprints — TOML files — describe the desired state: which slices, how many instances. TOML id = "org.example:commerce:1.0.0" [[slices]] artifact = "org.example:inventory-service:1.0.0" instances = 3 [[slices]] artifact = "org.example:order-processor:1.0.0" instances = 5 Apply with one command: aether blueprint apply commerce.toml. The cluster resolves artifacts, loads slices, distributes instances across nodes, registers routes, and starts serving traffic. The cluster converges to the desired state automatically. Infrastructure Independence. Aether nodes are identical — there's only one deployment artifact to manage at the infrastructure level. Node updates and application deployments run on completely independent schedules. Update Java — roll it out across nodes without touching applications. Update the Aether runtime — same. Update business logic — deploy new slice versions without touching infrastructure. Each independently, each without downtime. This is the fundamental benefit of proper separation: when layers don’t share a deployment unit, they don’t share a deployment schedule. Fault Tolerance: The 50% Rule The system survives the failure of less than half the nodes. Performance may degrade until replacements spin up, but functionality remains intact — actual redundancy, not just graceful degradation. A 5-node cluster tolerates 2 simultaneous failures. A 7-node cluster tolerates 3. The same request, processed by any available node, produces the same result. Quorum requires (N/2) + 1 nodes — as long as a majority is alive, the cluster operates normally. Leader failover is consensus-based and near-instant. Node replacement happens automatically — the Cluster Deployment Manager detects the deficit and provisions a replacement through the NodeProvider interface. The entire recovery sequence — from failure detection through state restoration to serving traffic — completes without human intervention. When a node fails, the recovery is automatic. Requests to slices on the failed node are immediately retried on healthy nodes. A replacement node is provisioned. It connects to peers, restores consensus state from a cluster snapshot, re-resolves artifacts from the DHT, and reactivates assigned slices. Dead nodes are automatically removed from routing tables. The new leader reconciles the stale state. No human intervention required. Rolling updates leverage this fault tolerance for zero-downtime deployments with weighted traffic routing: SQL aether update start org.example:order-processor 2.0.0 -n 3 aether update routing <id> -r 1:3 # 25% to v2, 75% to v1 aether update routing <id> -r 1:1 # 50/50 aether update complete <id> # 100% to v2, drain v1 Deploy during business hours. Shift traffic gradually — 10% canary, then 25%, 50%, 75%, 100%. Monitor health metrics at each step. If health degrades — error rate exceeds thresholds, latency spikes — instant rollback with one command: aether update rollback <id>. Traffic immediately shifts back to the old version. The 3 AM pager alert becomes an audit log entry. For Every Project: Legacy, Greenfield, And Everything Between Legacy Migration Your legacy Java system doesn’t need a complete rewrite. It needs a path forward. Pick a relatively independent part of your system — something hitting limits, something with clear boundaries. Extract an interface. Annotate it with @Slice. Wrap the legacy implementation: Java private Promise<Report> generateReport(ReportRequest request) { return Promise.lift(() -> legacyReportService.generate(request)); } One line to enter the Aether world. Promise.lift() wraps the legacy call, catches exceptions, and returns a proper Result inside a Promise. Your legacy code keeps running. Call sites don't change. You haven't added risk — the initial deployment in Ember runs in the same JVM as your existing application, which means it's no worse than what you have today. You've laid the foundation for removing risk, not adding it. Moving from Ember to a full Aether cluster is a configuration change, not a code change — and that's when the 50% rule starts to apply. From there, it’s the strangler fig pattern. Extract a hot path, deploy it as a slice, route traffic, repeat. Each extracted slice can be gradually refactored using the peeling pattern: first wrap everything in Promise.lift(), then decompose into a Sequencer with each step still wrapped, then peel individual steps into clean JBCT patterns. Tests pass at every step. The lift() calls mark exactly where legacy code remains, making progress visible and remaining work obvious. No rewrite is required. No big bang migration. One sprint to the first slice in production. The migration article covers the full path in detail — from initial wrapping through gradual peeling to clean JBCT code. Greenfield Development For new projects, slices enable a granularity that’s impossible with traditional microservices. Each slice can be as lean as a single method — and that’s the recommended approach. There are no operational or complexity tradeoffs for small slices because Aether handles all the infrastructure overhead. No container to configure, no load balancer to provision, no monitoring to set up per service. You get per-use-case scaling: one slice serving 50 instances during peak load while another idles at minimum. That kind of granularity would be operationally insane with traditional microservices — each needing its own container, load balancer, monitoring, and deployment pipeline. With Aether, it’s the default. JBCT patterns — Leaf, Sequencer, Fork-Join, Condition, Iteration, and Aspects — compose naturally within slices. Each slice method is a data transformation pipeline: parse input, gather data, process, respond. The patterns provide consistent structure within slices. Slices provide consistent boundaries between them. The Spectrum Same slice model, different granularity. A service slice wraps an entire legacy component. A lean slice implements a single method. Both coexist in the same cluster, deployed and scaled independently. Slice is the executable unit. It can be big or small as necessary and convenient. The architecture accommodates both monolith migration and greenfield development simultaneously. Your legacy system gains fault tolerance while new features get maximum deployment flexibility. Scaling: Two Levels, Three Tiers of Intelligence Two-Level Horizontal Scaling Aether scales in two dimensions independently: Slice scaling: Spin up more instances of a specific slice on existing nodes. Classes are already loaded—scaling takes milliseconds, not seconds.Node scaling: Add more machines to the cluster. The node connects, restores state, and begins accepting work. Independent controls, combined effect. Each node hosts at most one instance of a given slice, so scaling a slice beyond the current node count requires adding nodes first. Add 2 more nodes to a 3-node cluster, then scale a hot slice to 5 instances—one per node. No coordination between the two dimensions is required. Three-Tier Decision System Tier 1—Decision Tree (1-second intervals) Instant reactive decisions based on CPU utilization, request latency, queue depth, and error rate. CPU above 70%? Add an instance. Below 30% sustained? Remove one (if above minimum). Latency exceeding the P95 threshold? Scale up. Error rate above 1% due to timeouts? Scale up. Deterministic, predictable, fast. Handles routine load changes with configurable cooldown periods — 30 seconds for scale-up, 5 minutes for scale-down — to prevent oscillation. Tier 2—TTM Predictor (60-second intervals) An ONNX-based machine learning model (Tiny Time Mixers) analyzes a 60-minute sliding window of metrics — CPU usage, request rate, P95 latency, and active instances. Forecasts load and adjusts the Decision Tree’s thresholds preemptively. If TTM predicts a load increase, it lowers the scale-up CPU threshold by 20% so the reactive tier responds earlier. The cluster scales before the spike arrives, not after. The key design principle: the cluster always survives on Tier 1 alone. TTM enhances; it doesn’t replace. If TTM fails — model load error, insufficient data, inference failure — the Decision Tree continues with default thresholds. The error is logged and recorded in metrics. No scaling disruption. Tier 3—LLM-based (planned) Long-term capacity planning and cluster health monitoring. Seasonal pattern prediction, maintenance window planning, anomaly investigation. This tier is not yet implemented — the current system operates with Tiers 1 and 2. Fault tolerance makes preemptible instances viable for burst scaling. If a spot instance gets reclaimed, the cluster survives — it was designed for nodes to disappear. You don’t need a PhD in distributed systems or a dedicated platform team. The scaling system manages itself. Development Experience: From Laptop To Production Three Environments, Zero Code Changes Ember Single-process runtime with multiple cluster nodes running in the same JVM. Fast startup, simple debugging. Deploy your slices alongside your existing application — slices call each other directly in-process. No network overhead. Standard debugger breakpoints work as expected. Perfect for local development and unit testing. Forge A 5-node cluster simulator running on your laptop. Real consensus. Real routing. Real failure scenarios. Kill nodes, crash the leader, trigger rolling restarts — and watch the cluster recover in real time through a web dashboard with D3.js topology visualization, per-node metrics (CPU, heap, leader status), and event timeline. Configurable load generation with TOML-based multi-target configuration lets you stress-test realistic scenarios — set request rates, define body templates, and run duration-limited load tests. Chaos operations include node kill, leader kill, and rolling restart. Forge validates the entire dependency graph before starting anything. Aether Production cluster. Same slices, same code, different scale. Your code doesn’t know which environment it’s running in. Whether inter-slice calls are in-process or cross-network is transparent. Tooling 37 CLI commands cover deployment, scaling, updates, artifacts, observability, controller configuration, and alerts — in both single-command and interactive REPL modes. A web dashboard streams real-time metrics via WebSocket — no polling. 30+ REST management endpoints enable full programmatic control of everything the CLI can do. Prometheus-compatible metrics export (/metrics/prometheus) integrates with existing monitoring stacks. Metrics are push-based at 1-second intervals, with zero consensus overhead — they bypass the consensus protocol entirely. Per-method invocation tracking with P50/P95/P99 latency and configurable slow-invocation detection strategies (fixed threshold, adaptive, per-method, composite) surfaces performance issues before users notice. Dynamic aspects let you toggle LOG/METRICS/LOG_AND_METRICS modes per method at runtime via REST API, without redeployment. Test realistic failure scenarios on your laptop. Deploy to production with a config change, not a code change. Maturity Aether is a working system, not a concept paper. 81 end-to-end tests are run against real 5-node clusters in Podman containers, validating cluster formation, quorum establishment, slice deployment and scaling, blueprint application with topological ordering, multi-instance distribution, artifact upload, and cross-node resolution with integrity verification, leader failure and recovery, node restart with state restoration, and orphaned state cleanup after leader changes. The recovery and fault tolerance claims come from automated tests against real clusters, not marketing slides. Let Java Be Java Java’s lineage leads here. From applets managed by browsers, through servlets managed by application servers, through EJBs managed by enterprise containers, through OSGi managed by runtime frameworks, to Aether, managed by a distributed runtime. The fat-jar era was a detour. An understandable one — when Docker emerged, it offered a universal packaging format, and the industry standardized on it regardless of language. Java adopted the patterns of languages that were designed to produce standalone binaries. We started treating Java applications like Go programs with a heavier runtime. But it was never the destination. Java was designed for managed environments. The JVM makes it possible. The runtime manages the application. That’s the lineage. Aether continues it. Two entry points exist today. Wrap your legacy monolith behind a @Slice interface in one sprint and gain fault tolerance without rewriting anything. Or start fresh with maximum clarity — lean slices, explicit contracts, per-use-case scaling. Both paths converge on the same runtime, the same cluster, the same operational model. Both paths can coexist — legacy service slices and new lean slices running side by side. Fault tolerance is not an afterthought — it's the foundation. Scaling is not your problem — it's the environment’s. Infrastructure is not your code — it's the runtime’s. The heavy winter coat comes off. The application breathes. Resources Pragmatica Aether—project siteGitHub Repository—source code
Editor’s Note: The following is an article written for and published in DZone’s 2026 Trend Report, Platform Engineering and DevOps: How Internal Platforms, Developer Experience, and Modern DevOps Practices Accelerate Software Delivery. The role of the enterprise developer has become more complex over time as organizations adopt new technologies and tools, often without retiring their old ones. Add high staff turnover and increasing time and cost pressure, and developers are confronted with charting their own path through the SDLC. The purpose of internal developer platforms (IDPs) is to create a win-win scenario that benefits developers and their organizations. In this tutorial, you’ll define one golden path for a backend service that covers service setup, deployment, observability, and guardrails end to end. Step 1: Define the Platform Product and First Golden Path Successful IDP efforts focus on end-to-end developer workflows: building a new interface, deploying an updated microservice, running a regression suite, or standing up an environment. Ideally, the whole workflow can be supported directly from your IDP as self-service. Once you have identified the workflow to support, you need to design the “golden path,” which parts you will standardize and what you expose as configuration. It’s important to get that balance right. Components that have to change often, like service accounts, interfaces, and sizing, should be configurable. Creating templates and patterns helps reduce variability between outputs, making it easier to roll out necessary patching and updates. For the first golden path, pick one high-value workflow that is common, repeatable, and easy to measure. We will use the deployment of our backend service to an integration test environment because it touches build, deployment, validation, and evidence capture in one flow. User adoption is the key to success. To measure, it’s important to track both user adoption, such as how often a workflow is triggered, and outcome metrics like the number of compliant application instances, percentage of deployment failures, and average deployment duration. Step 2: Design the Golden Path (Templates and Defaults) Next, we get to design the golden path. An important factor for the developer experience is to provide documentation with contextual guidance. This can be traditional how-to guides or more advanced features such as AI-enabled chatbots. The documentation should explain how testing, application deployments, and other lifecycle activities happen along the golden path, and provide architectural guidance on embedding any newly developed capability in the existing architecture. Standards and governance are other aspects that should be available for self-service, including naming conventions, common libraries, and reusable services. On the technical side, the golden path should cover at least the following: Code repo and standard branching structureSkeleton code based on coding standards (e.g., environment config file, logging framework, data layer)CI/CD pipeline into an ephemeral cloud environment, or pointed at a standard persistent dev environmentSkeleton quality gates in the CI/CD pipeline (e.g., unit test, functional regression, security scan)Access to common utilities; injection of environment values (e.g., URLs, IP addresses, access and secrets management)Ability to spin up the environment (if cloud based) And lastly, the IDP needs to be designed with intuitive naming, a search function, tagging methods, and a hierarchical browsing structure so users can easily find the appropriate golden path. Supporting multiple ways of discovery provides a more resilient interface and eases the adoption of new golden path templates as they become available. For our backend service, choosing the workflow will show a representation of the steps included. Step 3: Wire Self-Service Workflows (Without Tickets) Besides golden path templates, IDPs should aim to be a one-stop shop for developers, so common requests should be available for self-service. Your existing ticket/ITSM systems can be a good source for creating the backlog. Identify the most common requests and start automating them in priority order. In many cases, a ticket continues to be useful even in the self-service model for tracking and approvals, which can be integrated into the automatic workflow. Approvals should be provided automatically based on defined criteria, and only require human approvals when the request is outside of those parameters, such as access to restricted data, use of expensive resources, and non-standard requests. Over time, developers should be able to request new features through a transparent feature backlog and voting mechanism to engage the community. When creating new features, keep things common wherever possible and provide ways for users to tailor their requests. For example, the standard deployment process might define a step for secrets injection, but some teams will tailor the process to skip it as necessary. This approach has two advantages: It creates a common language and process across teams and reduces the work to build and maintain the IDP. Spending a bit more time up front to create customizability pays off over the medium and long term. For our backend service, the first service we define is deployment to the integrated test environment. Step 4: Standardize Delivery With CI/CD + GitOps + IaC in One Flow The principle of the golden path deployment process remains unchanged: You build a software artifact once, and you deploy it multiple times along the environment path. For our backend service, promotion should happen through a versioned change (think GitOps) to the desired environment state, so application version, infrastructure definition, and deployment evidence remain traceable together. In the build stage, code is prepared in any pre-compile steps, then compiled and packaged with all necessary configuration files. In the deployment process, environment variables are injected, and the package is deployed to the target environment, which is scripted as Infrastructure as Code. The validation itself is usually layered: a technical validation to confirm that the deployment was correct, functional regression of core functionality, and testing the new changes. This sequence is based on speed of feedback, which is important in an automated IDP service. When a validation check fails, the golden path needs to have defined failure behavior with clear steps to execute. Pipeline failures like a broken build, failed test, or policy violation will block progression automatically. If the environment is materially impacted, a rollback is automatically initiated. Only in rare cases should a human evaluation be required — for example, when the level of ambiguity is too high and impacts stakeholders who are using the environment. Some policy violations can be treated with time-bound exceptions, such as allowing a new security vulnerability in a non-production environment. This allows functional testing to continue while the team remediates the security vulnerability. Prior to going live, the exception would be removed so the security vulnerability doesn’t progress to production. These types of exceptions should be set to auto-expire to prevent them from being forgotten later. Golden Path Steps and Guardrails stepself-service actionguardrailevidence Build Trigger pipeline via check-in action in source control Code scan and unit test results Build log, composition scan result Promote to non-prod environment Merge to staging branch, promotion request Technical validation, regression test Test results Promote to prod Promotion request Approval and compliance check Approval and audit trail Rollback Automated trigger or manual request Post-rollback validation and regression test Test results Step 5: Bake in Operability for Observability and Day-2 Readiness IDPs reduce cognitive load and toil as solutions to common concerns are built in. This is especially true for the operational concerns. Each workflow and self-service feature creates the log files and traces for auditability. All code and configuration are driven from version control, and the metrics recorded provide insights into the outcomes and performance of the IDP. New operational initiatives, like introducing a software bill of materials, can be rolled out across all technologies that use the IDP. When done correctly, templates can be updated centrally, and the log files provide full auditability to identify where old versions are still in use, reducing the overall security exposure. The IDP governance model needs to define the ownership of templates and any inheritance rules. For instance, some teams will tailor the template by adding additional steps required for their technology. Alongside the IDP instrumentation, standard dashboards and alert definitions ship with the template, pre-wired to the appropriate ownership group. Who responds to what is documented, not assumed. Runbooks and escalation paths are stored in version control alongside the service itself so they evolve with the system rather than rotting in a forgotten wiki page. Our backend service will include the following with the golden path: Logs, metrics, and tracesAlertsRunbook linkOwnership metadata The final piece is the feedback loop. Incidents, near-misses, and recurring friction points are resolved and also used to help continuously improve the platform, first becoming a backlog item. Step 6: Add Guardrails and Governance Without Slowing Delivery The IDP should leverage approved templates where possible and embed basic compliance and policy checks in the workflows. Platform developers will receive immediate feedback on any problems they need to fix. When issue resolution requires a longer time, time-bound exceptions can be allowed. Along the environment path from development to production, the quality gates should become more restrictive as the software quality improves. For our backend service, we define security scanning prior to deployments, and we don’t accept any deviations from the corporate standard for it. We follow a simple block, warn, escalate paradigm. The goal is to address problems that teams can deal with immediately and provide enough time for more complex work. This balance allows work to flow at pace. It is important to version templates and workflows so you can track what is in use. When significant problems are identified with a version, you can use the IDP logs to find any items in use and replace them quickly. Having the right guardrails in place might feel restrictive but in fact reduces the amount of rework over time as there are fewer incidents. Fast feedback reduces the time it takes to resolve problems. Step 7: Measure Adoption, DevEx, and Platform ROI One of the key success factors for IDPs is having the ability to measure adoption (covered earlier), developer experience, and platform ROI (e.g., DORA, SPACE). This allows you to break down and distinguish between adoption measures and outcome metrics. Implementing these criteria in the platform from the beginning captures data systematically. Good adoption measures to start with: number of executed workflows, number and currency of templates, and number of active users. The following outcome metrics can also be used as part of the business case for IDPs: deployment failure rate, MTTR, incident volumes, number of tickets, and security vulnerabilities. The team managing the IDP should actively use the metrics together with captured feedback from the user base (e.g., feature requests) to prioritize the backlog. Executive dashboards should be implemented to provide accountability and increase support across the organization. A Minimal IDP You Can Scale Bringing it together, take the following actions to kick-start your internal developer platform: Choose a common and not too complex workflow for your first golden pathCreate the code repository and CI/CD pipelineDefine a self-service UI for the workflowEmbed quality gates, metrics, and operational tooling into the workflow Start with one workflow for one pilot team, prove the path, then extend to the next workflow or team. Don’t forget to engage with the pilot users to receive feedback and support adoption. If you want to dive deeper, explore the CNCF Platforms for Cloud-Native Computing whitepaper and Platform Engineering Maturity Model. This is an excerpt from DZone’s 2026 Trend Report, Platform Engineering and DevOps: How Internal Platforms, Developer Experience, and Modern DevOps Practices Accelerate Software Delivery.Read the Free Report
Industrial ERPs often look structured on the surface: item IDs, purchase orders, stock levels. But in many companies, they are overloaded with unintentional duplicates because the most important information is buried inside an unstructured description field. In heavy industry, descriptions are entered manually with little guidance or validation, so the same part shows up many times under different names. A single component might live three separate lives in the system: “Bosch Pump”“Pump, Bosch, 1500w”“Pmp Bsch hydraulic” A domain expert would usually conclude that, with very high probability, these items refer to the same piece of equipment. To an ERP, these items are three different products. Multiply this by millions of line items, multiple sites, multiple languages, and years of legacy records, and you get a familiar outcome: data exists, but it is often too inconsistent to use efficiently. In practice, this impacts two workflows that most teams care about: Search: Teams can’t reliably find what is already in storage, especially when they don’t know the exact naming used in the ERP.Deduplication: Procurement keeps purchasing “new” parts that already exist, and dead stock accumulates over time. A common goal is to make descriptions usable for downstream automation by extracting a homogeneous, decision-ready structure from raw text. This is where LLM-powered deep parsing can be useful. LangChain is a practical framework for implementing this as a repeatable, testable pipeline rather than a collection of ad-hoc prompts. Traditional Approaches and Where They Tend to Break Full String Matching Works when descriptions are clean and strictly validated. In real operations, that level of discipline is not common, so full matching is usually a starting point rather than a long-term solution. Rules and Regex Safe, deterministic, explainable. Also fragile. Descriptions differ by plant, vendor, technician, language, and era. As the scope expands to new item categories, maintaining rules often becomes a major effort, and coverage gaps show up quickly. Approximate String Matching Useful for typos. Less useful for meaning. It won’t reliably capture that “hydraulic” and “liquid pressure” are related contexts, and it often struggles with short abbreviations that carry important signals. Selecting and tuning fuzzy matching across many categories is usually a constant trade-off. Semantic Search Semantic search converts a query and each item into embeddings and retrieves nearest neighbors by vector distance in a similar way to how distance is calculated between two locations on a map. This is helpful in messy industrial text because it improves recall even when wording differs. But industrial data is a place where small textual differences can imply big functional differences. A wrench 11mm and a wrench 21mm may sit close in semantic space because they share most context (“wrench”, tool category, usage), yet they are not interchangeable. For this reason, semantic search alone is usually not sufficient for reliable deduplication or for a strict “find exact part” search. It can help find candidates, but you typically still need a precision layer based on extracted attributes. What Deep Parsing Changes in Practice Efficient industrial search is not only about typos and semantics. Most of the time, it requires extracting the characteristics that drive interchangeability: Manufacturer/brandCategory (bolt, bearing, motor, pump)Key specs (dimensions, voltage, pressure, thread type, strength class)Normalized unitsStandards and constraints relevant to the business LLMs can help because they can follow high-level instructions like “extract manufacturer and relevant specs” while handling inconsistent phrasing, abbreviations, and word order. The catch is that LLMs are not “born” with your domain knowledge. They may not know internal abbreviations, preferred naming conventions, or which standard matters for a specific category. Fine-tuning a large model is often infeasible, and frequently not the most cost-effective path. A practical approach is to provide domain context at runtime (RAG) and enforce structured outputs with validation. Parsing and Processing Pipeline Raw Item Descriptions + Metadata Input is usually not just a free-text description. Metadata often matters for correct interpretation and downstream decisions: Item category (if available)Vendor or manufacturer field (often partially filled)Unit of measure, procurement group, material group This metadata is not always reliable, but even imperfect signals can help with routing and validation. Schema Generation Industrial catalogs are not one schema. A bolt, a bearing, and an electric motor have different attributes. Trying to force everything into one universal schema typically produces either a huge, sparse structure or a generic output that is not very usable. A practical alternative is to generate a category-aware schema: Choose or infer the category (via metadata or a lightweight LLM classification)Let the LLM generate a category-specific template, what fields matter for this category. RAG can help here with instructions and guidelines that will make this crucial part more specific.Add attribute dictionaries via RAG (allowed values, synonyms, common aliases, unit conventions) At this point, your output is a Pydantic schema that was generated by LLM with the help of internal company knowledge. LLM Parsing Once it is known what needs to be extracted, the parsing step focuses on turning raw text into a typed record, keeping the output consistent across items in the same category thanks to schemes generated earlier. Two practical points tend to matter: Abbreviations and internal naming conventions are often the hardest part of the description. RAG can help to ease the pain hereStandards and additional instructions help the model disambiguate what a term means in context The output should be structured and typed (numbers as numbers, units normalized, enums constrained), not a free-form explanation. For example, LangChain handles that via structured output internally. Validation Parsing alone is not enough for operational use. You usually need validation and a decision layer, especially if the output drives deduplication. In many deployments, the match decision is not “LLM decides duplicates.” It is “LLM extracts structure, rules decide,” with an optional LLM contribution for borderline cases or for generating human-readable rationales for review queues. Parsed output The final output is a normalized JSON-like record you can store alongside the original description. It can also include confidence indicators and validation flags. This parsed layer becomes useful across multiple workflows: Structured filters and faceted search (category, manufacturer, normalized specs)Better deduplication at ingestion time (flag likely duplicates before creating a new item)Reporting and governance (how much of the catalog has missing critical attributes)Downstream inventory optimization, where “same part” needs to be defined consistently Common Failure Points Hallucinations LLMs can invent values, especially when inputs are thin or ambiguous. In practice, this is usually handled by enforcing strict schemas, grounding the model with retrieved domain context, running deterministic validators for units and plausible ranges, and routing low-confidence outputs to human review instead of letting the pipeline silently guess. Data Quality Threshold If the description is something like “Bearing 10”, there may not be enough signal to extract a meaningful structure. The practical approach here is to treat deep parsing as decision support rather than a guaranteed reconstruction of missing data. It helps to store uncertainty explicitly (for example, leaving required fields empty with a reason) and route such items into enrichment workflows using supplier documentation, historical purchase orders, or technician prompts. Summary For industrial inventory, deep parsing is most effective when treated as a structured extraction pipeline: raw descriptions and metadata are converted into a category-aware schema, parsed by an LLM, validated with rules, and written out as normalized structured data. RAG supports each stage with the right knowledge (templates and dictionaries for schema generation, abbreviations and standards for parsing, tolerances and edge cases for validation). LangChain is a practical way to orchestrate this end-to-end with enforced output contracts, retrieval where it matters, and enough testing and tracing to operate it in production, resulting in a structured layer that typically improves search and enables scalable deduplication without an ERP rewrite.
After implementing contract-first integration across three different microservices architectures, I've learned that the biggest bottleneck in distributed systems isn't technical; it's coordination between teams. When Team A waits for Team B to finish their API before starting integration work, you're throwing away weeks of productivity. Contract-first development flips this model. By defining your integration contracts upfront (OpenAPI specs, Avro schemas, database migrations), you enable teams to work in parallel, catch breaking changes early through CI validation, and treat contracts as the single source of truth. This isn't theoretical; this is how Netflix, Uber, and Amazon scale their engineering organizations. In this article, I'll show you production-ready contract-first patterns using Java 21, Spring Boot 3, OpenAPI 3, Apache Kafka with Avro, and Flyway migrations. You'll see real code from a working system that handles the three critical integration boundaries: REST APIs, event-driven messaging, and database schemas. The Problem: Why Traditional Integration Fails at Scale When systems integrate without contracts, you hit three major problems: 1. Serial development bottlenecks: Team A builds an API endpoint. Team B waits. Team B builds a consumer. Team A discovers the payload doesn't match what Team B expected. Both teams spend days debugging mismatched assumptions. 2. Late discovery of breaking changes: You deploy a service update that changes a response field from customerId to customer_id. Your API consumers break in production. No tests caught it because there was no contract to validate against. 3. Documentation drift: The Swagger docs say the endpoint returns a 201. The actual code returns a 200. The integration tests expect a 404. Nobody knows which one is right because there's no single source of truth. Contract-first development solves all three by making the contract the authoritative specification that generates code, mocks, tests, and documentation. What Contract-First Actually Means Contract-first means you define the integration boundary first (the contract) and then write code that conforms to it. The contract is not an afterthought or generated documentation. It's the design specification. A complete contract includes: Operations: Endpoints (REST), topics (Kafka), or tables (database)Data shapes: Request/response DTOs, event schemas, column definitionsValidation rules: Required fields, constraints, data typesError model: HTTP status codes, error payloads, dead-letter queuesNon-functional rules: Idempotency, retries, compatibility policies, SLAs Here's the mental model: Agree on the contract → generate tools → build independently → let CI enforce alignment. Contract Type 1: REST API Contracts With OpenAPI OpenAPI 3 specs are the gold standard for REST API contracts. You define endpoints, request/response schemas, validation rules, and error responses in YAML. Then you generate server stubs, client SDKs, mocks, and documentation from that single source. OpenAPI Contract Example Here's a production OpenAPI contract for an order management API: YAML openapi: 3.2.0 info: title: Orders API version: 1.0.0 description: Contract-first REST API for order management paths: /v1/orders: post: operationId: createOrder summary: Create a new order requestBody: required: true content: application/json: schema: $ref: '#/components/schemas/CreateOrderRequest' responses: '201': description: Order created successfully content: application/json: schema: $ref: '#/components/schemas/OrderResponse' '400': description: Validation error content: application/json: schema: $ref: '#/components/schemas/ErrorResponse' '409': description: Idempotency conflict content: application/json: schema: $ref: '#/components/schemas/ErrorResponse' /v1/orders/{orderId}: get: operationId: getOrder parameters: - name: orderId in: path required: true schema: type: string responses: '200': description: Order found content: application/json: schema: $ref: '#/components/schemas/OrderResponse' '404': description: Order not found components: schemas: CreateOrderRequest: type: object required: [customerId, items] properties: customerId: type: string example: CUST-123 idempotencyKey: type: string description: Optional key for safe retries items: type: array minItems: 1 items: $ref: '#/components/schemas/OrderItem' OrderItem: type: object required: [sku, quantity] properties: sku: type: string example: SKU-001 quantity: type: integer minimum: 1 OrderResponse: type: object required: [orderId, customerId, status, items, timestamp] properties: orderId: type: string customerId: type: string status: type: string enum: [CREATED, REJECTED] items: type: array items: $ref: '#/components/schemas/OrderItem' timestamp: type: string format: date-time ErrorResponse: type: object required: [code, message, traceId, timestamp] properties: code: type: string enum: [VALIDATION_ERROR, NOT_FOUND, CONFLICT, INTERNAL_ERROR] message: type: string traceId: type: string timestamp: type: string format: date-time Implementing the Provider Side (Spring Boot) The contract drives the implementation. Your Spring Boot controller implements what the contract specifies: Java @RestController @RequestMapping("/v1/orders") @RequiredArgsConstructor @Slf4j public class OrderController { private final OrderService orderService; /** * POST /v1/orders * Contract: contracts/openapi/orders-api.v1.yaml */ @PostMapping public ResponseEntity<OrderResponse> createOrder( @Valid @RequestBody CreateOrderRequest request) { log.info("Creating order for customer: {}", request.customerId()); OrderResponse response = orderService.createOrder(request); return ResponseEntity .status(HttpStatus.CREATED) .body(response); } /** * GET /v1/orders/{orderId} */ @GetMapping("/{orderId}") public ResponseEntity<OrderResponse> getOrder(@PathVariable String orderId) { OrderResponse response = orderService.getOrder(orderId) .orElseThrow(() -> new ResourceNotFoundException("Order not found: " + orderId)); return ResponseEntity.ok(response); } } Critical implementation details: DTOs match contract schemas exactly: CreateOrderRequest and OrderResponse are Java records generated from or validated against the OpenAPI specHTTP status codes match contract: 201 for creation, 404 for not found, 409 for idempotency conflictsValidation is enforced: @Valid annotation ensures request validation matches OpenAPI constraintsError responses are standardized: All errors return ErrorResponse with consistent structure Consumer Parallel Development Here's where contract-first shines. While your team implements the provider, the consumer team can: Generate a Java client from the OpenAPI spec using openapi-generatorRun a mock server that returns valid responses based on the contractWrite integration tests against the mockSwitch to the real service when it's ready (no code changes needed) The consumer doesn't wait for you to finish. They develop in parallel. Contract Type 2: Event Contracts With Kafka and Avro Event-driven systems need two contract layers: topic semantics (human-readable) and schema definitions (machine-validated). Topic Semantics Contract Document the operational contract for each topic: Markdown ## Topic: orders.order-created.v1 - **Purpose**: Emitted when an order is created successfully - **Key**: orderId (partition affinity per order) - **Delivery**: At-least-once (consumers must be idempotent) - **Consumer requirement**: Deduplicate by eventId - **Retry policy**: Consumer retries transient errors - **DLQ**: orders.order-created.v1.dlq for poison messages - **Compatibility**: Backward compatible schema evolution required Avro Schema Contract The Avro schema is your machine-validated contract: Java { "type": "record", "name": "OrderCreated", "namespace": "com.acme.events", "doc": "Event emitted when an order is successfully created", "fields": [ { "name": "eventId", "type": "string", "doc": "Unique event ID for idempotent processing" }, { "name": "occurredAt", "type": "string", "doc": "ISO 8601 timestamp" }, { "name": "orderId", "type": "string" }, { "name": "customerId", "type": "string" }, { "name": "source", "type": ["null", "string"], "default": null, "doc": "Order source (WEB, MOBILE, API). Nullable for backward compatibility." }, { "name": "items", "type": { "type": "array", "items": { "type": "record", "name": "OrderItem", "fields": [ {"name": "sku", "type": "string"}, {"name": "quantity", "type": "int"} ] } } } ] } Key pattern: The source field is nullable with a default value. This supports backward-compatible evolution; old consumers can read new events, and new consumers can read old events. Kafka Producer Implementation Java @Component @RequiredArgsConstructor @Slf4j public class OrderEventPublisher { private final KafkaTemplate<String, Object> kafkaTemplate; public void publishOrderCreated(OrderCreated event) { String key = event.getOrderId(); log.debug("Publishing OrderCreated: orderId={}, eventId={}", key, event.getEventId()); CompletableFuture<SendResult<String, Object>> future = kafkaTemplate.send("orders.order-created.v1", key, event); future.whenComplete((result, ex) -> { if (ex == null) { log.info("Published OrderCreated: orderId={}, partition={}, offset={}", key, result.getRecordMetadata().partition(), result.getRecordMetadata().offset()); } else { log.error("Failed to publish OrderCreated: orderId={}", key, ex); } }); } } Production concerns addressed: Key-based partitioning: Using orderId as the key ensures all events for the same order go to the same partition, maintaining orderingAsync publishing with callbacks: Non-blocking publish with explicit success/failure handlingStructured logging: Captures partition and offset for troubleshooting Kafka Consumer With Idempotency At-least-once delivery means duplicates are possible. Consumers must deduplicate: Java @KafkaListener(topics = "orders.order-created.v1", groupId = "billing-service") public void onOrderCreated(OrderCreated event) { // Check if already processed if (processedEventsRepository.existsByEventId(event.getEventId())) { log.debug("Skipping duplicate event: {}", event.getEventId()); return; } // Process event billingService.createInvoice( event.getOrderId(), event.getCustomerId(), event.getItems() ); // Mark as processed processedEventsRepository.save( new ProcessedEvent(event.getEventId(), Instant.now()) ); } Idempotency pattern: Check eventId before processing, store it after processing. If the same event arrives twice, the second one is ignored. Architecture Diagram: Contract-First Flow After implementing contract-first integration across three different microservices architectures, I've learned that the biggest bottleneck in distributed systems isn't technical; it's coordination between teams. When Team A waits for Team B to finish their API before starting integration work, you're throwing away weeks of productivity. Contract-first development flips this model. By defining your integration contracts upfront (OpenAPI specs, Avro schemas, database migrations), you enable teams to work in parallel, catch breaking changes early through CI validation, and treat contracts as the single source of truth. This isn't theoretical; this is how Netflix, Uber, and Amazon scale their engineering organizations. In this article, I'll show you production-ready contract-first patterns using Java 21, Spring Boot 3, OpenAPI 3, Apache Kafka with Avro, and Flyway migrations. You'll see real code from a working system that handles the three critical integration boundaries: REST APIs, event-driven messaging, and database schemas. The Problem: Why Traditional Integration Fails at Scale When systems integrate without contracts, you hit three major problems: 1. Serial development bottlenecks: Team A builds an API endpoint. Team B waits. Team B builds a consumer. Team A discovers the payload doesn't match what Team B expected. Both teams spend days debugging mismatched assumptions. 2. Late discovery of breaking changes: You deploy a service update that changes a response field from customerId to customer_id. Your API consumers break in production. No tests caught it because there was no contract to validate against. 3. Documentation drift: The Swagger docs say the endpoint returns a 201. The actual code returns a 200. The integration tests expect a 404. Nobody knows which one is right because there's no single source of truth. Contract-first development solves all three by making the contract the authoritative specification that generates code, mocks, tests, and documentation. What Contract-First Actually Means Contract-first means you define the integration boundary first (the contract) and then write code that conforms to it. The contract is not an afterthought or generated documentation. It's the design specification. A complete contract includes: Operations: Endpoints (REST), topics (Kafka), or tables (database)Data shapes: Request/response DTOs, event schemas, column definitionsValidation rules: Required fields, constraints, data typesError model: HTTP status codes, error payloads, dead-letter queuesNon-functional rules: Idempotency, retries, compatibility policies, SLAs Here's the mental model: Agree on the contract → generate tools → build independently → let CI enforce alignment. Contract Type 1: REST API Contracts With OpenAPI OpenAPI 3 specs are the gold standard for REST API contracts. You define endpoints, request/response schemas, validation rules, and error responses in YAML. Then you generate server stubs, client SDKs, mocks, and documentation from that single source. OpenAPI Contract Example Here's a production OpenAPI contract for an order management API: YAML openapi: 3.2.0 info: title: Orders API version: 1.0.0 description: Contract-first REST API for order management paths: /v1/orders: post: operationId: createOrder summary: Create a new order requestBody: required: true content: application/json: schema: $ref: '#/components/schemas/CreateOrderRequest' responses: '201': description: Order created successfully content: application/json: schema: $ref: '#/components/schemas/OrderResponse' '400': description: Validation error content: application/json: schema: $ref: '#/components/schemas/ErrorResponse' '409': description: Idempotency conflict content: application/json: schema: $ref: '#/components/schemas/ErrorResponse' /v1/orders/{orderId}: get: operationId: getOrder parameters: - name: orderId in: path required: true schema: type: string responses: '200': description: Order found content: application/json: schema: $ref: '#/components/schemas/OrderResponse' '404': description: Order not found components: schemas: CreateOrderRequest: type: object required: [customerId, items] properties: customerId: type: string example: CUST-123 idempotencyKey: type: string description: Optional key for safe retries items: type: array minItems: 1 items: $ref: '#/components/schemas/OrderItem' OrderItem: type: object required: [sku, quantity] properties: sku: type: string example: SKU-001 quantity: type: integer minimum: 1 OrderResponse: type: object required: [orderId, customerId, status, items, timestamp] properties: orderId: type: string customerId: type: string status: type: string enum: [CREATED, REJECTED] items: type: array items: $ref: '#/components/schemas/OrderItem' timestamp: type: string format: date-time ErrorResponse: type: object required: [code, message, traceId, timestamp] properties: code: type: string enum: [VALIDATION_ERROR, NOT_FOUND, CONFLICT, INTERNAL_ERROR] message: type: string traceId: type: string timestamp: type: string format: date-time Implementing the Provider Side (Spring Boot) The contract drives the implementation. Your Spring Boot controller implements what the contract specifies: Java @RestController @RequestMapping("/v1/orders") @RequiredArgsConstructor @Slf4j public class OrderController { private final OrderService orderService; /** * POST /v1/orders * Contract: contracts/openapi/orders-api.v1.yaml */ @PostMapping public ResponseEntity<OrderResponse> createOrder( @Valid @RequestBody CreateOrderRequest request) { log.info("Creating order for customer: {}", request.customerId()); OrderResponse response = orderService.createOrder(request); return ResponseEntity .status(HttpStatus.CREATED) .body(response); } /** * GET /v1/orders/{orderId} */ @GetMapping("/{orderId}") public ResponseEntity<OrderResponse> getOrder(@PathVariable String orderId) { OrderResponse response = orderService.getOrder(orderId) .orElseThrow(() -> new ResourceNotFoundException("Order not found: " + orderId)); return ResponseEntity.ok(response); } } Critical implementation details: DTOs match contract schemas exactly: CreateOrderRequest and OrderResponse are Java records generated from or validated against the OpenAPI specHTTP status codes match contract: 201 for creation, 404 for not found, 409 for idempotency conflictsValidation is enforced: @Valid annotation ensures request validation matches OpenAPI constraintsError responses are standardized: All errors return ErrorResponse with consistent structure Consumer Parallel Development Here's where contract-first shines. While your team implements the provider, the consumer team can: Generate a Java client from the OpenAPI spec using openapi-generatorRun a mock server that returns valid responses based on the contractWrite integration tests against the mockSwitch to the real service when it's ready (no code changes needed) The consumer doesn't wait for you to finish. They develop in parallel. Contract Type 2: Event Contracts With Kafka and Avro Event-driven systems need two contract layers: topic semantics (human-readable) and schema definitions (machine-validated). Topic Semantics Contract Document the operational contract for each topic: Markdown ## Topic: orders.order-created.v1 - **Purpose**: Emitted when an order is created successfully - **Key**: orderId (partition affinity per order) - **Delivery**: At-least-once (consumers must be idempotent) - **Consumer requirement**: Deduplicate by eventId - **Retry policy**: Consumer retries transient errors - **DLQ**: orders.order-created.v1.dlq for poison messages - **Compatibility**: Backward compatible schema evolution required Avro Schema Contract The Avro schema is your machine-validated contract: JSON { "type": "record", "name": "OrderCreated", "namespace": "com.acme.events", "doc": "Event emitted when an order is successfully created", "fields": [ { "name": "eventId", "type": "string", "doc": "Unique event ID for idempotent processing" }, { "name": "occurredAt", "type": "string", "doc": "ISO 8601 timestamp" }, { "name": "orderId", "type": "string" }, { "name": "customerId", "type": "string" }, { "name": "source", "type": ["null", "string"], "default": null, "doc": "Order source (WEB, MOBILE, API). Nullable for backward compatibility." }, { "name": "items", "type": { "type": "array", "items": { "type": "record", "name": "OrderItem", "fields": [ {"name": "sku", "type": "string"}, {"name": "quantity", "type": "int"} ] } } } ] } Key pattern: The source field is nullable with a default value. This supports backward-compatible evolution; old consumers can read new events, and new consumers can read old events. Kafka Producer Implementation Java @Component @RequiredArgsConstructor @Slf4j public class OrderEventPublisher { private final KafkaTemplate<String, Object> kafkaTemplate; public void publishOrderCreated(OrderCreated event) { String key = event.getOrderId(); log.debug("Publishing OrderCreated: orderId={}, eventId={}", key, event.getEventId()); CompletableFuture<SendResult<String, Object>> future = kafkaTemplate.send("orders.order-created.v1", key, event); future.whenComplete((result, ex) -> { if (ex == null) { log.info("Published OrderCreated: orderId={}, partition={}, offset={}", key, result.getRecordMetadata().partition(), result.getRecordMetadata().offset()); } else { log.error("Failed to publish OrderCreated: orderId={}", key, ex); } }); } } Architecture Diagram: Contract-First Flow Figure 1: Contract-first development flow showing parallel team development Note: Solid arrows represent generation or implementation flow. Dashed arrows represent validation relationships. Contract Type 3: Database Contracts With Flyway Contract Type 3: Database Contracts With Flyway Database schemas are contracts, too. Flyway migrations let you version and evolve schemas with the same contract-first approach. Base Schema Migration File: V1__create_orders.sql SQL CREATE TABLE orders ( id VARCHAR(32) PRIMARY KEY, customer_id VARCHAR(32) NOT NULL, status VARCHAR(16) NOT NULL, created_at TIMESTAMP NOT NULL DEFAULT now() ); CREATE TABLE order_items ( order_id VARCHAR(32) NOT NULL REFERENCES orders(id), sku VARCHAR(64) NOT NULL, quantity INT NOT NULL CHECK (quantity > 0), PRIMARY KEY (order_id, sku) ); CREATE INDEX idx_orders_customer_id ON orders(customer_id); Schema Evolution: Expand/Migrate/Contract When you need to add a field without breaking old code, use the expand/migrate/contract pattern: Expand (V2__add_order_source.sql): SQL ALTER TABLE orders ADD COLUMN source VARCHAR(32); Migrate (application code): SQL -- Backfill for existing rows UPDATE orders SET source = 'UNKNOWN' WHERE source IS NULL; Contract (V3__enforce_source_not_null.sql): SQL -- After all systems produce source ALTER TABLE orders ALTER COLUMN source SET NOT NULL; This three-phase approach lets you deploy schema changes without downtime. CI/CD Enforcement: Making Contracts Real Contracts only work if they're enforced. Here are the CI gates that prevent breaking changes: REST API: OpenAPI Breaking Change Detection Run openapi-diff in CI to catch breaking changes: YAML # .github/workflows/api-contract-check.yml - name: Check for API breaking changes run: | npx openapi-diff \ main:contracts/openapi/orders-api.v1.yaml \ HEAD:contracts/openapi/orders-api.v1.yaml \ --fail-on-breaking This fails the build if you: Remove required fieldsChange field typesRemove endpointsChange HTTP status codes Kafka: Schema Registry Compatibility Check Configure Schema Registry to enforce backward compatibility: YAML # Schema Registry configuration confluent.schema.registry.url=http://schema-registry:8081 spring.kafka.properties.schema.registry.url=http://schema-registry:8081 # Enforce backward compatibility spring.kafka.producer.properties.auto.register.schemas=true spring.kafka.producer.properties.use.latest.version=true When you try to publish an incompatible schema, the producer fails at startup, before reaching production. Database: Flyway Validation Flyway validates migrations on startup: Properties files spring.flyway.validate-on-migrate=true spring.flyway.baseline-on-migrate=false If someone manually modified the database schema, Flyway detects the mismatch and fails the deployment. Integration Flow Diagram Figure 2: End-to-end flow showing REST → DB → Kafka integration with contracts enforced at each boundary Real-World Benefits: Production Implementation Experience After implementing contract-first across multiple services (orders, billing, inventory), teams consistently observe significant improvements: Integration quality: Substantial reduction in integration bugs reaching productionMost remaining bugs are business logic issues, not contract mismatchesClear contracts eliminate ambiguity in integration expectations Development velocity: Integration cycles measured in weeks rather than monthsConsumer teams start integration work immediately instead of waiting for provider completionMock servers enable realistic integration testing without coordination delays Operational reliability: CI-enforced contract validation prevents breaking changes from reaching productionBreaking changes caught during PR review, not after deploymentAutomated validation ensures compatibility before merge Documentation accuracy: Documentation generated from OpenAPI specs stays synchronized with implementation by designSwagger UI always reflects actual API behavior as both derive from the same contractNo manual documentation maintenance required Common Pitfalls and How to Avoid Them Pitfall 1: Treating Contracts as Documentation Wrong approach: Write code first, generate OpenAPI from annotations later.Right approach: Write OpenAPI first, generate server stubs, or validate implementation against it. Pitfall 2: Skipping CI Validation Wrong approach: Trust developers to manually check compatibility.Right approach: Automate breaking change detection in CI. Make it a required check. Pitfall 3: No Schema Evolution Strategy Wrong approach: Add fields without defaults, breaking old consumers.Right approach: All new fields must be optional or have defaults. Test with multiple schema versions. Pitfall 4: Ignoring Idempotency Wrong approach: Assume exactly-once delivery in Kafka.Right approach: Design consumers to deduplicate by eventId. Assume at-least-once. Pitfall 5: Coupling Contracts to Implementation Wrong approach: Expose internal database schema directly in API contracts.Right approach: Design contracts for consumers, not internal convenience. Use DTOs that map between the external contract and the internal model. When Contract-First Makes Sense Contract-first isn't always the answer. Use it when: ✅ Multiple teams integrate: The coordination cost justifies the upfront contract design effort✅ Public APIs or partner integrations: External consumers need stability and clear documentation✅ Microservices architecture: Services must evolve independently without breaking dependents✅ High change frequency: CI validation catches breaking changes early when changes are frequent Skip contract-first when: ❌ Single small team: Coordination overhead is low, so formal contracts add friction❌ Prototyping: You're exploring the problem space and expect major pivots❌ Internal tools with one consumer: The provider and consumer are maintained by the same person Key Takeaways Contracts enable parallel development: Provider and consumer teams work simultaneously instead of seriallyCI validation prevents breaking changes: Automate OpenAPI diffs, schema compatibility checks, and migration validationIdempotency is not optional: At-least-once delivery in Kafka requires consumers to deduplicate by event IDSchema evolution requires backward compatibility: Use nullable fields with defaults to support gradual rolloutsContracts are design specifications, not afterthoughts: Define contracts first, generate code from them What's Next? If you're implementing contract-first integration: Start with one integration, pick your most painful cross-team dependencyWrite the OpenAPI spec or Avro schema first, before any implementationSet up CI validation for that contract (openapi-diff or Schema Registry)Measure reduction in integration bugs and coordination overheadExpand to other integrations once you've proven the pattern Contract-first isn't a silver bullet, but it transforms integration from a coordination nightmare into a governance problem that CI can solve. When you're coordinating three teams across two time zones, that shift makes the difference between shipping in weeks versus months. Full source code: github.com/wallaceespindola/contract-first-integrations Related reading: OpenAPI 3.0 Specification: spec.openapis.orgConfluent Schema Registry: docs.confluent.io/platform/current/schema-registryFlyway Documentation: flywaydb.org/documentationMastering Contract-First API Development: moesif.com/blogResearch on Microservices Issues: arxiv.org Need more tech insights? Check out my GitHub, LinkedIn, and Speaker Deck. Happy coding!
Your chaos experiments passed. Your RAG pipeline is lying to you anyway. I've watched this play out more times than I'd like to admit. A team runs a thorough chaos suite, including pod failures, network partitions, and database failovers. Everything recovers cleanly. Dashboards stay green. The team ships with confidence. Three weeks later, a support ticket surfaces. Then ten more. The AI is producing answers that are fluent, confident, and factually wrong. No alert fired. No SLO breached. The infrastructure never blinked. This isn't a monitoring gap you close with a better dashboard. It's a category error in how we've defined resilience for AI systems, and until you see that distinction clearly, every chaos experiment you run is measuring the wrong thing. The Assumption That's Been Quietly Wrong For fifteen years, chaos engineering has operated on one core premise: the system's meaningful state is its operational state. Is it up? Does it recover? Can it handle a node failure at 2 AM? For systems built around databases, queues, and network hops, these are exactly the right questions. The entire discipline of Chaos Monkey, Gremlin, LitmusChaos, and AWS FIS was built to answer them. Agentic AI systems break this premise at the foundation. They're not distributed systems in the traditional sense. They're reasoning systems. And reasoning systems have two states you need to care about simultaneously: State dimensionTraditional distributed systemAgentic AI systemWhat "healthy" meansService is up, latency within SLAOutputs remain grounded in source truthHow failure manifests5xx errors, timeouts, crashesSilent drift, confident wrong answersTime to detectSeconds to minutesDays to weeks — if everFailure unitRequest or serviceBehavior over timeCircuit breaker analogyTrips on error rateNo native equivalentWhat chaos testsInfrastructure recovery✗ Cannot test behavioral integrity That last row is the entire problem. As Marc Bishop, Director of Business Growth at Wytlabs, put it after his team's retrieval embeddings drifted silently under catalog updates: "Resilience for AI means validating behavior under stress, not merely surviving it." I hold U.S. Patent 12242370B2 for intent-based chaos engineering, a framework that treats intent preservation, not just infrastructure recovery, as the core testable property of a resilient system. When I developed that framework, the failure mode I was targeting was a multi-domain infrastructure losing semantic coherence under adversarial conditions. I didn't fully anticipate how precisely that same problem would show up in production LLM pipelines and how fast. What's Actually Breaking: Five Failure Modes Nobody Has Named Yet You can't test for something you haven't named. The existing chaos engineering literature has no vocabulary for AI behavioral failure. Here's a working taxonomy from production accounts across 25+ engineering teams: 1. Retrieval Drift The vector retrieval layer silently shifts toward faster, lower-precision matches after a failure event. Outputs remain structurally valid but are grounded in the wrong documents. Rafael Sarim Oezdemir, Head of Growth at EZContacts, ran chaos injection on their RAG-based customer support chatbot. His infrastructure numbers post-chaos looked perfect: 99.99% uptime, clean latency recovery, green across the board. Three days later, the chatbot was answering return policy questions incorrectly in 7% of cases. Root cause: "Our chatbot started answering return policy questions incorrectly. We diagnosed the root cause as a subtle shift in retrieval precision; our pipeline was favoring quicker, less precise vector matches post-chaos. Infrastructure recovered. The behavior of the model didn't." No existing chaos tool measures retrieval precision. That's the gap. 2. Context Amnesia Each individual component in a multi-agent pipeline appears healthy, but the end-to-end reasoning chain becomes incoherent across hops. Luis Haberlin at CallSetter AI watched this unfold in a voice agent for an insurance brokerage: "The infrastructure was bulletproof... but often into production, agents started hearing 'I already told the robot about my home and auto' from confused callers." The agent correctly retrieved policy details early in a conversation, then lost context at the 90-second mark and restarted the needs assessment from scratch. Nothing crashed. The reasoning rotted at the handoff boundary. Jacob Kalvo, CEO of Live Proxies, hit the same wall in a market analytics pipeline: "While each summary was technically provided on schedule, there were small errors beginning to creep into the output, specific market signals being under-represented, inconsistencies developing in the logic chain, and some outputs making confident assertions regarding incorrect or misleading information." Every infrastructure check passed. The reasoning chain had silently decohered. 3. Confidence-Accuracy Decoupling The model produces high-confidence, well-formatted outputs even as accuracy degrades. The system sounds more certain as it becomes less reliable. Jayanand Sagar, COO at Hyperbola Network, saw this after a partial node recovery rebuilt the retrieval index from a stale snapshot. Output quality deteriorated over 11 days, undetected: "The model never complained. The closer the degraded output was to the original, the more convincingly it generated confident-sounding responses based on outdated context." Confidence scores are not accuracy proxies. A model grounded in a degraded context will confidently state incorrect information. No infrastructure metric tells you this is happening. 4. Intent Drift Outputs gradually decohere from the original business intent without any single triggering event. Behavior changes incrementally, across dozens of interactions, with no failure timestamp to anchor an investigation. Tyler Denk, CEO of beehiiv, described a system that passed every load and failure scenario correctly in testing, then shifted over longer production cycles: "The structure of responses remained intact, but subtle inconsistencies in reasoning and formatting started appearing across different workflows. Without a defined behavioral baseline, it became impossible to determine when the system had actually started drifting." 5. Epistemic Failure The model's picture of the world becomes stale or wrong, but all reasoning over that picture continues to function correctly. The system is reasoning well, about incorrect premises. Nicolas, founder of Reddinbox, runs a production AI pipeline classifying Reddit posts in real time across thousands of threads daily. "A few months back, everything looked fine. No downtime, no errors, latency normal. But output quality had quietly decayed." Reddit's content distribution had shifted, flooded with AI-generated posts that were structurally coherent but semantically hollow, and his classifier kept returning high-confidence scores on them. His diagnosis is the sharpest framing I've seen for why infrastructure chaos is blind to this failure class: "No chaos experiment would have caught that because the failure wasn't infrastructure, it was epistemic. We had zero observability on input distribution drift. We were watching the system, not what the system was consuming." Why Agentic Pipelines Make Every One of These Worse A single degraded LLM component is a tractable problem. A multi-agent pipeline turns it into something that actively resists detection. In a traditional microservice, a degraded component returns an error, trips a circuit breaker, and gets isolated. In a multi-agent pipeline, a degraded reasoning component returns a confident output that propagates forward, amplifying the failure rather than surfacing it. Dario Ferrari, co-founder of OpenClawVPS, watched this play out firsthand when a client's RAG-based customer support system passed all infrastructure tests but then silently shifted retrieval behavior after a network partition: "AI infrastructure that survives every test but provides incorrect answers is still resilient but fails its job badly." The blast radius of an undetected reasoning failure grows with every agent hop. By the time users notice, it has compounded through multiple layers of stored state. The Missing Layer: Behavioral Assertions Brandy Hastings, SEO Strategist at SmartSites, described the realization her team came to after AI-assisted workflows passed every infrastructure check but degraded in production: "We realized our testing didn't account for output quality over time. We were validating uptime, not alignment." That gap between uptime and alignment is where every one of the five failure modes above lives. Most teams have three layers of observability, and only two of them are working: Layer 2 is where all the interesting failures live, and it's completely absent from most production stacks. Building it requires three things your current chaos practice almost certainly lacks: Behavioral contracts – not "returns a 200 response" but "returns a response with retrieval precision above threshold X when operating on a degraded index." These are the AI equivalent of SLOs, except the metric is semantic rather than operational. Intent-preserving chaos experiments – injecting failures at the data layer, retrieval layer, and reasoning layer, not just infrastructure. Each experiment needs an exit criterion that includes behavioral scoring against a fixed ground-truth set, not just recovery metrics. Post-chaos behavioral scoring – sampling outputs after every chaos run and scoring them against a baseline. Jayanand Sagar put a concrete benchmark on the minimum viable version: "An exponential run of chaos should pass behavioral standards to be within 3 to 5 percent of baseline scores of at least 50 sampled outputs before a system is declared stable." Jake Waldrop, Co-Founder of Recademics – a regulated outdoor safety certification platform, independently arrived at this same framing: "Semantic monitoring fills the gap between AI health and user safety by verifying what the AI is saying. My most significant change was to run adversarial prompts on standard stress tests to understand whether the model logic would collapse. Chaos engineering will have a colossal safety advantage when behavioral checks are integrated into any company operating within highly regulated industries." Oksana Fando, CDO at Truck1.eu, reached the same conclusion after equipment descriptions on their European vehicle marketplace gradually became less accurate following a data source degradation and a failure invisible to every standard metric: "We began testing the system's intent, checking whether business logic remains correct even with partial data loss." Testing system intent. That's exactly the property my patent formalizes. The fact that teams in healthcare, fintech, edtech, and European e-commerce are all independently converging on this is no coincidence. It's a structural gap making itself known. A Behavioral Observer You Can Drop In This Week The pattern is a sampling observer sitting in your serving layer. Replace _score() with RAGAS faithfulness, embedding cosine similarity, or an LLM-as-judge evaluator, depending on your quality rubric. The heuristic below is a working default: groundedness (how much of the response is anchored in retrieved docs) minus a penalty for hedging language that signals confidence erosion. Python import random class BehavioralObserver: def __init__(self, sample_rate=0.05, drift_threshold=0.15, baseline_size=50): self.sample_rate = sample_rate self.drift_threshold = drift_threshold self.baseline_size = baseline_size self.scores = [] self.baseline = None def observe(self, prompt, response, context): if random.random() > self.sample_rate: return score = self._score(response, context) if self.baseline is None: # Phase 1: build baseline self.scores.append(score) if len(self.scores) >= self.baseline_size: self.baseline = sum(self.scores) / len(self.scores) return drift = self.baseline - score # Phase 2: detect drift if drift > self.drift_threshold: print(f"[DRIFT ALERT] score={score:.3f} baseline={self.baseline:.3f} drift={drift:.3f}") # pagerduty.trigger(...) or datadog.metric("ai.behavioral.drift", drift) def _score(self, response, context): doc_words = set(" ".join(context.get("retrieved_docs", [])).lower().split()) terms = response.lower().split() groundedness = len([t for t in terms if t in doc_words]) / max(len(terms), 1) hedges = ["i think", "not sure", "might be", "possibly"] return max(groundedness - sum(0.05 for h in hedges if h in response.lower()), 0.0) # Drop in: observer = BehavioralObserver() def serve(prompt, context): response = your_llm_call(prompt, context) observer.observe(prompt, response, context) return response Two things worth knowing. The 5% sample rate catches degradation without adding latency, at high traffic, even 1% gives you a statistically robust signal. The baseline lock after 50 samples is deliberate: running behavioral chaos against an unlocked baseline is like running load tests before you've measured normal traffic. 5 Behavioral Chaos Experiments to Run After Your Next Infrastructure Suite These aren't replacements for your existing chaos experiments. They're additive — run them after your infrastructure suite, with behavioral scoring as the exit criterion rather than uptime recovery. ExperimentWhat you injectWhat it testsExit criterionStale embedding injectionReplace embeddings with a 14-day-old snapshotRetrieval precision under stale indexScore within 5% of baseline across 50 sampled promptsPartial index degradationRemove 30% of documents from the vector storeGraceful degradation in retrieval recallHallucination rate stays flat vs. baselineContext window truncationTruncate retrieved context to 40% of normalReasoning quality under a constrained contextGroundedness score stays above thresholdAgent handoff latency injectionAdd 800ms delay between agent hopsMulti-agent coherence under degraded commsEnd-to-end intent preserved across all hopsMemory poisoning simulationInject one factually wrong document into the retrieval storeRAG faithfulness under adversarial dataThe system identifies or flags the conflicting document Define the exit criterion before you inject the failure. That's the same discipline your infrastructure chaos practice demands for SLO-based rollback conditions; it applies here too. What the Field Is Actually Saying Vitaly Yago, CEO of PhotoGov, described the shift his team made after hitting this wall in production: "We began implementing chaos for behavior, not just for infrastructure. Instead of testing whether the system will recover, we test whether the quality of decisions is maintained under noise, data changes, and successive updates." John Russo, VP of Healthcare Technology Solutions at OSP Labs, came to the same realization after behavioral degradation appeared in a clinical AI workflow that had passed every infrastructure check: "It is no longer just about systems staying up, it is about systems staying correct under stress." Two engineers, two completely different industries, same conclusion. The field has moved on from the question of whether AI systems survive failure. The question it's now wrestling with, without a good answer yet at scale, is whether they reason correctly after failure. The chaos engineering discipline has fifteen years of hard-won tooling for testing the first question. It has almost nothing for the second. That's not a criticism of the existing tools. It's a signal that the discipline needs to grow a second layer. The practitioners whose experiences shaped this article are already building it in production, because the failures forced them to. The only question for your team is whether you discover your agentic system's behavioral limits through a chaos experiment you designed, or a production incident you didn't see coming. The Short Version: Three Things to Add Before Your Next Chaos Run Lock a behavioral baseline first. Sample 50–100 representative inputs and store expected outputs before injecting any failure. Your chaos experiments now have a behavioral exit criterion, not just infrastructure recovery metrics.Make retrieval precision a first-class signal. The most common failure vector across the teams I spoke with was RAG degradation invisible to standard monitoring. Retrieval precision scoring belongs alongside latency and error rate on your dashboards.Log reasoning chains, not just outputs. For multi-agent pipelines, log the reasoning path each agent used to produce its output. When that structure changes without a deployment event triggering it, that's your behavioral alert, the equivalent of a latency spike, but for the quality of reasoning.
When AWS announced Lambda Durable Functions at re: Invent 2025, my first reaction was, "Okay, but how is this different from Step Functions?" I have been building serverless workflows on AWS for a while now, and Step Functions has always been my go-to service for orchestrating multi-step pipelines. So naturally, I wanted to put this new capability to the test. I decided to build a simple document processing workflow, an ETL pipeline with human-in-the-loop approval using both Durable Functions and Step Functions, then run 1,000 actual document processing workflows through each system. What I found surprised me. Not just the cost difference (79% cheaper with Durable Functions), but the trade-offs that nobody is really talking about yet. In this tutorial, I will walk you through building a zero-cost approval workflow using Lambda Durable Functions with Python. Along the way, I will share the actual cost numbers and the lessons that would have saved me a few hours of debugging. The Problem: Approval Workflows Are Expensive If you have ever built a document processing system that requires human approval, you know the pain. Someone uploads a file, your system processes it, and then... it sits there. Waiting for a human to review and approve it. That wait can be 5 minutes, 20 minutes, or even hours. Traditional approaches to handling this waiting are: Polling: Your code keeps checking every 30 seconds — "Is it approved yet? How about now?" making those calls the entire time.Always-on server: An EC2 instance or ECS container sits idle, costing you money 24/7, just to catch that one approval event.External state management: You build a custom solution with DynamoDB, SQS, and Lambda triggers — works fine, but it requires you to maintain a state machine you built yourself. What if your workflow could just... pause? No compute charges. No polling. Just pause, wait for the human to do their thing, and resume exactly where it left off. That is exactly what Lambda Durable Functions enables with the wait_for_callback pattern. What We Are Building Here is the workflow we will implement: Extract data → Transform data → Load data → Wait for approval (≈20 min) → Finalize & archive A CSV file gets uploaded to an S3 bucket under the uploads/ prefix. Our durable function picks it up, runs it through three ETL steps (extract, transform, load), then pauses execution and waits for a human to approve the processed data through a shared approval API. Once approved, the function resumes, finalizes the job, and archives the file. The key part? During that 20-minute (or 2-hour, or 2-day) approval wait, you pay absolutely nothing for compute. Architecture Overview The project uses three separate SAM stacks: Markdown shared-resources/ # Approval API, DynamoDB, SNS (shared by both systems) durable-functions/ # Lambda Durable Functions ETL pipeline step-functions/ # Step Functions ETL pipeline (for comparison) The shared approval handler serves for both workflow types using a single API. When a job comes in for approval, it checks the workflowType field, and if it is durable-functions, it calls send_durable_execution_callback_success.If step-functions, it calls send_task_success. Same API endpoint, different callback mechanisms under the hood. Prerequisites Before we begin, make sure you have the following: AWS SAM CLI (latest version recommended)Python 3.14 runtime AWS account with Lambda, DynamoDB, S3, SNS, and API Gateway accessDocker for local Lambda testing Check your SAM CLI version: Markdown sam --version Step 1: Deploy Shared Resources First Before the ETL pipeline, we need the shared infrastructure — the approval API, DynamoDB table for pending approvals, and SNS topic for notifications. Here is the shared-resources/ SAM template: YAML # shared-resources/template.yaml AWSTemplateFormatVersion: '2010-09-09' Transform: AWS::Serverless-2016-10-31 Description: Shared resources for ETL approval workflow Parameters: ApproverEmail: Type: String Description: Email address to receive approval notifications Default: [email protected] Resources: PendingApprovalsTable: Type: AWS::DynamoDB::Table Properties: TableName: etl-pending-approvals BillingMode: PAY_PER_REQUEST AttributeDefinitions: - AttributeName: jobId AttributeType: S KeySchema: - AttributeName: jobId KeyType: HASH TimeToLiveSpecification: AttributeName: ttl Enabled: true ApprovalNotificationTopic: Type: AWS::SNS::Topic Properties: TopicName: etl-approval-notifications Subscription: - Endpoint: !Ref ApproverEmail Protocol: email ApprovalApi: Type: AWS::Serverless::Api Properties: Name: ETL-Approval-API StageName: prod ApprovalHandlerFunction: Type: AWS::Serverless::Function Properties: FunctionName: ETL-Approval-Handler CodeUri: ./src Handler: approval_handler.handler Runtime: python3.14 MemorySize: 256 Timeout: 30 Environment: Variables: APPROVALS_TABLE: !Ref PendingApprovalsTable Policies: - DynamoDBCrudPolicy: TableName: !Ref PendingApprovalsTable - Version: '2012-10-17' Statement: - Effect: Allow Action: - states:SendTaskSuccess - states:SendTaskFailure - lambda:SendDurableExecutionCallbackSuccess - lambda:SendDurableExecutionCallbackFailure Resource: '*' Events: ApproveJob: Type: Api Properties: RestApiId: !Ref ApprovalApi Path: /approve/{jobId} Method: POST RejectJob: Type: Api Properties: RestApiId: !Ref ApprovalApi Path: /reject/{jobId} Method: POST GetJobStatus: Type: Api Properties: RestApiId: !Ref ApprovalApi Path: /status/{jobId} Method: GET Notice the approval handler has permissions for both states:SendTaskSuccess (Step Functions) and lambda:SendDurableExecutionCallbackSuccess (Durable Functions). This is the shared handler approach, one API that works with both workflow types. Deploy it: Markdown cd shared-resources sam build sam deploy --guided Step 2: The Durable Functions SAM Template Now the ETL pipeline itself for the Duration Functions. The key addition is the DurableConfig property. The DurableConfig property tells Lambda to enable durable execution for your function. YAML # durable-functions/template.yaml AWSTemplateFormatVersion: '2010-09-09' Transform: AWS::Serverless-2016-10-31 Description: Lambda Durable Functions ETL Pipeline Globals: Function: Runtime: python3.14 Architectures: - arm64 MemorySize: 512 Timeout: 900 Resources: ETLOrchestratorFunction: Type: AWS::Serverless::Function Properties: FunctionName: ETLDurableOrchestrator CodeUri: ./src Handler: handlers/etl_handler.lambda_handler MemorySize: 1024 Timeout: 900 DurableConfig: ExecutionTimeout: 86400 # 24 hours for human approval RetentionPeriodInDays: 14 # Keep execution history for debugging AutoPublishAlias: live Policies: - AWSLambdaBasicExecutionRole - S3CrudPolicy: BucketName: !Sub "${RawBucketName}-${AWS::AccountId}" - S3CrudPolicy: BucketName: !Sub "${ProcessedBucketName}-${AWS::AccountId}" - DynamoDBCrudPolicy: TableName: !Ref ETLMetadataTable - DynamoDBCrudPolicy: TableName: etl-pending-approvals - SNSPublishMessagePolicy: TopicName: etl-approval-notifications Events: S3Upload: Type: S3 Properties: Bucket: !Ref RawDataBucket Events: s3:ObjectCreated:* Filter: S3Key: Rules: - Name: prefix Value: uploads/ - Name: suffix Value: .csv Environment: Variables: PROCESSED_BUCKET: !Sub "${ProcessedBucketName}-${AWS::AccountId}" METADATA_TABLE: !Ref ETLMetadataTable APPROVALS_TABLE: etl-pending-approvals APPROVAL_TOPIC_ARN: !ImportValue ETL-ApprovalTopicArn APPROVAL_API_URL: !ImportValue ETL-ApprovalApiUrl A few things to notice here: MemorySize: 1024 on the orchestrator (overrides the 512 MB global default). Since this single function does all the work, it needs more memory.ExecutionTimeout: 86400 – This is the total workflow duration across all invocations (24 hours). The standard Timeout: 900 is the per-invocation limit (15 minutes). Each checkpoint/resume is a fresh invocation.AutoPublishAlias: live – AWS recommends using Lambda versions with durable functions. If you update code while an execution is suspended, replay will use the version that started the execution.S3 filter with prefix: uploads/ and suffix: .csv – Only CSV files under the uploads/ directory trigger the workflow.The stack imports shared resources via !ImportValue the approval table, SNS topic, and API URL from the shared stack. Step 3: Writing the Durable Function This is where it gets interesting. The entire ETL pipeline, including the approval wait, lives in a single Lambda function. No state machine definition. No JSON DSL. Just Python code. First, the individual ETL steps. Each one is a regular Python function in a separate file: Extract Python import csv import io import boto3 import logging logger = logging.getLogger() s3_client = boto3.client("s3") def extract_data(source_bucket, source_key, step_context=None): logger.info(f"Extracting from s3://{source_bucket}/{source_key}") response = s3_client.get_object(Bucket=source_bucket, Key=source_key) content = response["Body"].read().decode("utf-8") reader = csv.DictReader(io.StringIO(content)) records = list(reader) schema = { "columns": reader.fieldnames, "source_file": source_key, "file_size_bytes": response["ContentLength"] } logger.info(f"Extracted {len(records)} records with {len(schema['columns'])} columns") return {"data": records, "record_count": len(records), "schema": schema} Transform Python import logging from datetime import datetime logger = logging.getLogger() def transform_data(raw_data, schema_config, step_context=None): logger.info(f"Transforming {len(raw_data)} records") valid_records, rejected_records = [], [] for i, record in enumerate(raw_data): try: cleaned = {k: v.strip() if isinstance(v, str) else v for k, v in record.items()} if not cleaned.get("id") or not cleaned.get("name"): rejected_records.append({"index": i, "reason": "Missing required field"}) continue if "date" in cleaned: cleaned["date"] = normalize_date(cleaned["date"]) cleaned["_processed_at"] = datetime.utcnow().isoformat() for key in ["amount", "quantity", "price"]: if key in cleaned and cleaned[key]: try: cleaned[key] = float(cleaned[key]) except ValueError: cleaned[key] = None valid_records.append(cleaned) except Exception as e: rejected_records.append({"index": i, "reason": str(e)}) return { "data": valid_records, "valid_records": len(valid_records), "rejected_records": len(rejected_records), "rejection_details": rejected_records[:100] } def normalize_date(date_str): for fmt in ["%Y-%m-%d", "%m/%d/%Y", "%d-%m-%Y", "%Y/%m/%d"]: try: return datetime.strptime(date_str, fmt).strftime("%Y-%m-%d") except ValueError: continue return date_str Load Python import json import boto3 import logging logger = logging.getLogger() s3_client = boto3.client("s3") def load_data(transformed_data, target_bucket, target_key, step_context=None): logger.info(f"Loading {len(transformed_data)} records to s3://{target_bucket}/{target_key}") output_lines = "\n".join(json.dumps(r) for r in transformed_data) s3_client.put_object( Bucket=target_bucket, Key=target_key, Body=output_lines.encode("utf-8"), ContentType="application/jsonl", Metadata={"record_count": str(len(transformed_data))} ) summary = { "record_count": len(transformed_data), "columns": list(transformed_data[0].keys()) if transformed_data else [], "sample_records": transformed_data[:3] } return {"target_path": f"s3://{target_bucket}/{target_key}", "record_count": len(transformed_data), "summary": summary} Notice the steps are plain Python functions — no special decorator, no SDK import. They take step_context=None as an optional last parameter, which keeps them testable outside the durable execution context. Now the main ETL orchestrator that ties it all together: Python import json import os import logging from datetime import datetime from aws_durable_execution_sdk_python import durable_execution, DurableContext from steps.extract import extract_data from steps.transform import transform_data from steps.load import load_data from steps.finalize import finalize_job logger = logging.getLogger() logger.setLevel(logging.INFO) PROCESSED_BUCKET = os.environ.get("PROCESSED_BUCKET") METADATA_TABLE = os.environ.get("METADATA_TABLE") @durable_execution def lambda_handler(event, context: DurableContext): # Handle both S3 event format and direct invocation if "Records" in event: s3_event = event["Records"][0]["s3"] source_bucket = s3_event["bucket"]["name"] source_key = s3_event["object"]["key"] else: source_bucket = event.get("bucket") source_key = event.get("key") # Generate job_id deterministically using context.step() job_id = context.step( lambda _: f"etl-durable-{datetime.utcnow().strftime('%Y%m%d%H%M%S')}-" f"{source_key.split('/')[-1]}", name="generate-job-id" ) context.logger.info(f"Starting ETL job: {job_id}") # Step 1: Extract extracted = context.step( lambda _: extract_data(source_bucket, source_key, None), name="extract-data" ) context.logger.info(f"Extracted {extracted['record_count']} records") # Step 2: Transform transformed = context.step( lambda _: transform_data(extracted["data"], extracted.get("schema", {}), None), name="transform-data" ) # Step 3: Load loaded = context.step( lambda _: load_data(transformed["data"], PROCESSED_BUCKET, f"processed/{job_id}/output.jsonl", None), name="load-data" ) # --- EXECUTION PAUSES HERE --- # The submitter function stores the callback_id in DynamoDB # and sends an SNS notification to the reviewer. # No compute charges while waiting for approval. def submit_for_approval(callback_id: str, ctx): return notify_reviewer(job_id, callback_id, loaded["summary"]) approval = context.wait_for_callback( submitter=submit_for_approval, name="quality-check-approval" ) # Parse approval result if isinstance(approval, str): approval = json.loads(approval) if not approval or not approval.get("approved"): return {"status": "REJECTED", "job_id": job_id, "reason": approval.get("reason", "No reason")} # Step 4: Finalize (only runs after approval) final = context.step( lambda _: finalize_job(job_id, source_bucket, source_key, loaded, approval, METADATA_TABLE, None), name="finalize-job" ) return { "status": "COMPLETED", "job_id": job_id, "records_processed": transformed["valid_records"], "output_path": loaded["target_path"], "approved_by": approval.get("reviewer"), "completed_at": final["completed_at"] } Let me break down the important parts: @durable_execution – This decorator (imported from aws_durable_execution_sdk_python) enables the checkpoint/replay mechanism on the handler.context.step(lambda _: ..., name="...") – Each step call creates a checkpoint. On replay, completed steps return their cached results instantly instead of re-executing.context.wait_for_callback(submitter=..., name="...") – This is the zero-cost waiting magic. The submitter function receives a callback_id which gets stored in DynamoDB. Execution then pauses completely — Lambda saves the state, shuts down, and you stop paying.Determinism matters – Notice job_id is generated inside a context.step(). That is intentional. Since Lambda replays your function from the beginning on resume, datetime.utcnow() would produce a different value on each replay. Wrapping it in a step ensures the timestamp gets checkpointed and replayed consistently. The notify_reviewer function (in the same file) stores the callback details in DynamoDB and sends an SNS notification: Python def notify_reviewer(job_id, callback_id, summary): import boto3 from datetime import timedelta dynamodb = boto3.resource('dynamodb') sns_client = boto3.client('sns') approvals_table = os.environ.get('APPROVALS_TABLE', 'etl-pending-approvals') approval_topic_arn = os.environ.get('APPROVAL_TOPIC_ARN') approval_api_url = os.environ.get('APPROVAL_API_URL') table = dynamodb.Table(approvals_table) ttl = int((datetime.utcnow() + timedelta(hours=24)).timestamp()) table.put_item(Item={ 'jobId': job_id, 'callbackId': callback_id, 'functionArn': os.environ.get('AWS_LAMBDA_FUNCTION_NAME'), 'workflowType': 'durable-functions', 'summary': json.dumps(summary), 'status': 'pending', 'requestedAt': datetime.utcnow().isoformat(), 'ttl': ttl }) if approval_topic_arn: sns_client.publish( TopicArn=approval_topic_arn, Subject=f'ETL Job Approval Required: {job_id}', Message=f"Job ID: {job_id}\n" f"Approve: POST {approval_api_url}/approve/{job_id}\n" f"Reject: POST {approval_api_url}/reject/{job_id}" ) return {"job_id": job_id, "callback_id": callback_id, "status": "pending"} The workflowType: 'durable-functions' field is important — it tells the shared approval handler which callback mechanism to use when the reviewer responds. Step 4: The Shared Approval Handler When the reviewer clicks approve, the shared handler looks up the callbackId from DynamoDB and sends the callback to the paused durable execution: Python # shared-resources/src/approval_handler.py (key excerpt) if workflow_type == 'durable-functions': callback_id = approval_record.get('callbackId') if approved: lambda_client.send_durable_execution_callback_success( CallbackId=callback_id, Result=json.dumps(approval_response) ) else: lambda_client.send_durable_execution_callback_failure( CallbackId=callback_id, Error='JobRejected', Cause=reason or 'Job rejected by reviewer' ) elif workflow_type == 'step-functions': task_token = approval_record.get('taskToken') if approved: stepfunctions.send_task_success( taskToken=task_token, output=json.dumps(approval_response) ) Same API, same reviewer experience — the underlying callback mechanism is the only thing that differs. Step 5: Deploy and Test Deploy in order (shared resources first, since the other stacks import from it): Markdown # 1. Deploy shared resources cd shared-resources sam build && sam deploy --guided # 2. Deploy Durable Functions cd ../durable-functions sam build && sam deploy --guided Generate test data: Markdown python scripts/generate_test_data.py --count 10 --output test-data/ Upload files to trigger the workflow (note the uploads/ prefix — the S3 filter requires it): Markdown aws s3 cp test-data/ s3://etl-raw-data-bucket-YOUR_ACCOUNT_ID/uploads/ --recursive Check approval status and approve: Markdown # Check status curl https://<api-id>.execute-api.us-east-1.amazonaws.com/prod/status/<job-id> # Approve curl -X POST https://<api-id>.execute-api.us-east-1.amazonaws.com/prod/approve/<job-id> \ -H "Content-Type: application/json" \ -d '{"reviewer": "harpreet", "reason": "Data looks good"}' For bulk approvals during testing, the repo includes a handy script: Markdown ./scripts/approve_all_jobs.sh For local testing, the testing SDK supports pytest: Markdown pip install aws-lambda-durable-execution-sdk-testing pytest durable-functions/tests/ Step 6 (Optional): Deploy Step Functions for Comparison If you want to reproduce my full comparison, deploy the Step Functions stack too: Markdown cd step-functions sam build && sam deploy --guided Here is what the same workflow looks like in Amazon States Language: JSON { "StartAt": "ExtractData", "States": { "ExtractData": { "Type": "Task", "Resource": "${ExtractFunctionArn}", "ResultPath": "$.extractResult", "Next": "TransformData" }, "TransformData": { "Type": "Task", "Resource": "${TransformFunctionArn}", "ResultPath": "$.transformResult", "Next": "LoadData" }, "LoadData": { "Type": "Task", "Resource": "${LoadFunctionArn}", "ResultPath": "$.loadResult", "Next": "WaitForApproval" }, "WaitForApproval": { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken", "Parameters": { "FunctionName": "${ApprovalFunctionArn}", "Payload": { "taskToken.$": "$$.Task.Token", "jobId.$": "$.loadResult.job_id", "summary.$": "$.loadResult.summary" } }, "TimeoutSeconds": 86400, "ResultPath": "$.approvalResult", "Next": "CheckApproval" }, "CheckApproval": { "Type": "Choice", "Choices": [{ "Variable": "$.approvalResult.approved", "BooleanEquals": true, "Next": "FinalizeJob" }], "Default": "JobRejected" }, "JobRejected": { "Type": "Pass", "Result": { "status": "REJECTED" }, "End": true }, "FinalizeJob": { "Type": "Task", "Resource": "${FinalizeFunctionArn}", "End": true } } } Compare the two approaches. Durable Functions: one Python file, one Lambda, familiar programming constructs. Step Functions: a JSON state machine definition, five separate Lambda functions, plus the ASL learning curve. Both do the same thing. The Real Cost Numbers Now, here is the part that made me rebuild a mental model I had about serverless orchestration costs. I ran 1,000 CSV files through this exact workflow — both with Durable Functions and with the Step Functions implementation. The approval wait averaged about 20 minutes per document. Cost ComponentDurable FunctionsStep FunctionsDifferenceLambda invocations$0.000358$0.001-64%Lambda duration$0.0308$0.0179+72%State transitions$0.000$0.175-100%DynamoDB$0.003$0.0030%S3 operations$0.010$0.0100%TOTAL$0.044$0.207-79% Source: AWS CloudWatch Metrics The total cost, which is 79% cheaper, is mainly driven almost entirely by one thing: state transitions. Step Functions charges $0.025 per 1,000 state transitions. ASL workflow has 7 states (ExtractData, TransformData, LoadData, WaitForApproval, CheckApproval, JobRejected/FinalizeJob). For 1,000 workflows, that is 7,000 transitions, which costs $0.175. That single line (state transition) item is 84% of the total Step Functions cost. Durable Functions eliminates state transition costs. The trade-off? Higher Lambda duration costs ($0.031 vs. $0.018) because the durable function runs with 1,024 MB memory (single function handling all work) compared to Step Functions using 512 MB per function across five smaller functions. At scale, the difference adds up quickly: Daily VolumeDurable Functions/yearStep Functions/yearAnnual Savings1,000/day$16.06$75.56$59.5010,000/day$160.60$755.60$595100,000/day$1,606$7,556$5,950 And the most important validation: both systems achieved $0 compute cost during the 20-minute approval wait. That is the real game-changer compared to polling or always-on servers. Understanding the Replay Model One thing that confused me initially was the invocation count. I expected 1,000 invocations for 1,000 workflows. Instead, I got 1,788. Here is why. The checkpoint/replay model means each workflow requires a minimum of 2 invocations: Initial invocation — S3 trigger fires, function runs generate-job-id → extract → transform → load → submit-for-approval → pauseResume invocation — Callback received, function replays from the beginning (all completed steps return cached results instantly), then executes the finalize step So the theoretical minimum is 2,000 invocations for 1,000 workflows. The actual number was 1,788 because some workflows were still pending approval when I collected the metrics over the 24-hour measurement window. The important thing to remember: your code must be deterministic. Since Lambda replays your function from the beginning on resume, any non-deterministic operations (random numbers, timestamps, external API calls) must happen inside context.step() blocks so their results get checkpointed. Python job_id = context.step( lambda _: f"etl-durable-{datetime.utcnow().strftime('%Y%m%d%H%M%S')}-" f"{source_key.split('/')[-1]}", name="generate-job-id" ) That is exactly why the job_id generation in our code uses context.step().Without it, the timestamp would change on every replay. Here are some other examples where your code must be deterministic and how to avoid that: Deterministic IsssuesWhy It BreaksSolutionMath.random()Different value on every replayWrap in context.step()Date.now()Time keeps moving forwardUse context.timestamp or wrap in a stepGlobal variablesMight change between replaysPass state through function argumentsExternal API callsNetwork is a lieAlways wrap in context.step()Iterating over Map or SetIteration order can vary by runtimeUse arrays or ensure stable ordering When Not to Use Durable Functions I want to be honest about the trade-offs, because this is not a "Durable Functions is better than everything" story. Choose Step Functions when: Visual debugging matters. The step function state machine execution graph is genuinely superior. You can see exactly which step failed, inspect the input/output of each state, and non-technical stakeholders can actually understand what the workflow is doing. With Durable functions, AWS did provide visual analysis, monitoring, and debugging as well, but its little more developer-friendly. Multi-service orchestration. Step Functions has 220+ native AWS service integrations. DynamoDB, SQS, SNS, ECS, and Glue without writing Lambda glue code. In our Step Functions implementation, the ASL connects directly to Lambda function ARNs with built-in retry policies. With Durable Functions, all integrations go through your Lambda code.Express Workflows apply. For short-duration (under 5 minutes), high-throughput workflows, Step Functions Express Workflows use a different pricing model that can be very competitive. Choose Durable Functions when: Cost optimization is the priority (79% savings at scale)Workflows are Lambda-centric (your logic lives in Lambda code anyway)You prefer writing orchestration in Python/TypeScript/over Amazon States Language. AWS just now released Lambda Duration functions with Java in developer preview.Your logic is complex, and dynamic programming language is preferred by developers over the declarative ASL. AWS recommends a hybrid approach: use Durable Functions for application-level logic within Lambda, and Step Functions for high-level orchestration across multiple AWS services. Concurrency Planning — A Quick Note One thing worth mentioning: Durable Functions consolidates your entire workflow into a single Lambda function (ETLDurableOrchestrator in our case). This means your Lambda concurrency quota directly limits how many workflows can run simultaneously. Step Functions distributes execution across five separate Lambda functions (Extract, Transform, Load, Approval, Finalize), spreading the concurrency demand. In practice, this means you should plan your Lambda concurrency quotas carefully when using Durable Functions. If you expect burst uploads of hundreds or thousands of files at once, set reserved concurrency appropriately for your workload. This applies to both services — the difference is just where the concurrency demand concentrates. Wrapping Up Lambda Durable Functions is a genuinely useful addition to the serverless toolkit. For a simple ETL pipeline with human-in-the-loop approval, it delivered 79% cost savings over Step Functions while achieving the same 100% success rate and zero-cost waiting. The code-first approach feels natural if you are already comfortable writing Lambda functions in Python, TypeScript, or Java. The wait_for_callback pattern for human approvals is clean and straightforward. And the cost savings are real, which is driven entirely by the elimination of state transition charges. That said, Step Functions remains the better choice when visual workflow representation, multi-service orchestration, or operational simplicity are your priorities. There is no universal winner here, and it depends on what your team values more. The complete implementation — both SAM stacks, shared approval infrastructure, test data generation scripts, bulk approval scripts, and detailed cost analysis — is available here: github.com/hsiddhu2/aws-lambda-durable-vs-stepfunctions. Clone it, deploy both implementations, run your own 1,000-file comparison, and see the numbers for yourself. The ~79% cost advantage held consistent for this workflow, but your number will vary based on workflow complexity and state count.
May 28, 2026
by
CORE
Feature Flag Debt: Performance Impact in Enterprise Applications
May 27, 2026 by
Data Contracts as the "Circuit Breaker" for Model Reliability
June 1, 2026 by
Optimizing Databricks Spark Pipelines Using Declarative Patterns
June 1, 2026 by
Jakarta EE 12: Entering the Data Age of Enterprise Java
June 1, 2026
by
CORE