DevOps and CI/CD Resources

DZone's Featured DevOps and CI/CD Resources

When Snowflake Lies to You: Understanding False Failures in dbt Pipelines

By Janani Annur Thiruvengadam

CORE

The Problem Most Teams Get Wrong Every data engineer has lived this moment. A dbt model fails at 3 AM. You pull up the logs, see a type conversion error, and start digging through SQL. You check recent commits. Nothing changed. You inspect the upstream data. Nothing looks off. You rerun the job. It passes. You shrug, label it a transient issue, and go back to sleep. Then it happens again two weeks later. I want to talk about a specific category of pipeline failure that burns more engineering hours than almost anything else I've seen. It looks real. It carries a real error message, a real stack trace, a failed model, and a timestamp you can point to in your incident log. But no matter how long you stare at the SQL, you will not find the bug. Because there isn't one. I call these false failures: jobs that break not because your logic is wrong, but because your query contains an implicit assumption that the execution engine has been quietly honoring until the moment it decides not to. What This Actually Looks Like in Practice The pattern becomes obvious once someone points it out to you. A model fails with an error referencing a specific data value. A cast that didn't work. A type mismatch. You investigate and find that the offending value has existed in the table for months. It is not new. It did not arrive in last night's load. You rerun the job without changing anything. It passes. This is not a flaky test. It is not Snowflake having a bad day. It is a determinism problem, and it has a precise mechanical cause. Here's how to spot it: The failure is intermittent. It does not reproduce consistently, even in the same environment with the same data. The error references a value that has been present in the source table for a long time. A retry with zero intervention passes cleanly. And if you bother to pull up the Snowflake query profile, you'll notice the execution plan differs between the failing run and the passing one. That last detail is the key to everything. Why Snowflake Makes This Possible Here is something fundamental about Snowflake that most people working with it never fully internalize: it does not guarantee a consistent execution plan between runs of the same query. Snowflake's optimizer is adaptive. It reassesses strategy at runtime based on conditions that shift constantly. Several things influence which plan it picks: Micro-partition metadata gets updated asynchronously after data loads. The same query issued before and after a stats refresh can follow a meaningfully different path through the data. Warehouse size and concurrency affect parallelism thresholds. What gets broadcast-joined on an XS warehouse may be hash-joined on a Medium. The plan changes because the available compute changed. Data volume growth pushes the optimizer across execution thresholds over time. A strategy that worked at 10 million rows may get abandoned entirely at 500 million. Implicit type coercion is where things get dangerous. When two columns of different types meet in a join condition, Snowflake resolves the mismatch at runtime. Which side gets cast, and at what point during execution, can vary by plan. That last one is where most false failures are born. A Real Example: The Join That Only Breaks on Tuesdays Here's a model I've seen variations of in at least a dozen production pipelines: MySQL SELECT o.*, e.event_type FROM orders o LEFT JOIN events e ON o.order_id = e.event_key Looks harmless. But there's a mismatch hiding in plain sight. orders.order_id is typed as NUMBER. events.event_key is typed as VARCHAR. Snowflake allows this. It resolves the mismatch by casting the VARCHAR side to a number at join time. Since the vast majority of rows in the events table contain numeric-looking keys, this works fine. Almost all the time. But buried somewhere in that events table is a single row where event_key = 'INVALID_VAL'. It has been there for months. Nobody noticed because it never caused a problem. Here's why: on most runs, Snowflake's optimizer prunes away the micro-partition containing that row before the cast is ever attempted. The query completes without incident. The problematic value is never touched. Then one day, the optimizer picks a different plan. Maybe the warehouse was busier. Maybe a stats refresh shifted the pruning boundaries. Maybe the table crossed a size threshold. Whatever the cause, that partition gets scanned first this time. The cast is attempted. And the job dies with: Plain Text Numeric value 'INVALID_VAL' is not recognized Same query. Same data. Same code. The error is real. The bug is not. The Diagnostic Shift That Actually Helps The standard debugging instinct here is exactly wrong. You pull up the SQL. You check git blame. You inspect recent loads. You are anchored to the assumption that a code defect exists, and when it doesn't, you waste hours proving a negative. A better question to ask is: what is this query assuming about how the engine will execute it, and is that assumption guaranteed? When you approach a failing run through that lens, the investigation changes completely. Instead of reviewing business logic line by line, you open the query profile and compare the execution plan between the failing run and the last passing one. You look for differences in operator ordering, join strategies, partition pruning behavior, and the point at which type resolution happens. This reframes the diagnosis from "what broke in my code" to "what changed in how the engine chose to run this." That is a different investigation entirely. And it leads to fixes that actually hold. The Fix: Make the Contract Explicit The instinctive response to an intermittent failure is to add a retry. That solves the alert. It does not solve the problem. Worse, it hides the problem by reducing the frequency of visible failures while the underlying fragility quietly grows. The real fix is eliminating the implicit assumption. In the join example above, that means one line: MySQL -- Before: implicit cast, optimizer decides how and when ON o.order_id = e.event_key -- After: explicit cast, behavior is identical across all plans ON o.order_id::VARCHAR = e.event_key This is a small change with an outsized effect. The query no longer depends on the optimizer choosing a pruning strategy that avoids the bad row. Type resolution is now the query's responsibility, not the engine's. Behavior is consistent regardless of warehouse size, concurrency, or data volume. And here's the part that might feel counterintuitive: this fix will surface the data quality issue consistently. Every run will now encounter that INVALID_VAL row and handle it predictably. If it is genuinely bad data, you want to know about it on every run, not discover it randomly once a quarter when the optimizer happens to scan the wrong partition first. Building Pipelines That Don't Depend on Luck Type coercion in joins is the most common source of false failures, but the principle extends further. Anywhere your SQL relies on implicit behavior, behavior the engine provides by convention rather than by contract, you have a latent failure waiting for the right conditions. A few practices that materially reduce this risk in dbt and Snowflake environments: Cast early and cast explicitly. Use your dbt staging models to lock down column types at the source layer. A staging model that casts event_key::VARCHAR explicitly means every downstream model inherits that contract. No one has to guess. No one has to re-cast. Test join columns at the boundary. Add not_null, accepted_values, or custom schema tests on columns that participate in join conditions. These run before your models execute. They catch data quality problems at the source, not when they surface as cryptic execution-layer errors three models downstream. Treat intermittent failures as debt, not noise. Any job that fails occasionally without a corresponding code change is carrying hidden technical debt. Do not normalize it with retries. Schedule a real investigation. The failure rate will increase over time as data grows and execution plans shift more aggressively. Use the query profile before you use git blame. When a failure cannot be explained by code review, the Snowflake query profile is your next stop. Compare failing and passing runs side by side. If the plans diverge meaningfully, you are almost certainly looking at a false failure. Why This Gets Worse Over Time There is a scaling dimension to this problem that makes it urgent rather than merely interesting. At low data volumes, Snowflake's optimizer tends to be more consistent in its plan selection. The search space is smaller. The pruning decisions are more predictable. As tables grow into the hundreds of millions or billions of rows, execution plans shift more aggressively and more frequently. Thresholds get crossed. Statistics change faster. The optimizer explores more alternatives. Every implicit assumption that has been quietly tolerated at small scale becomes increasingly likely to be exposed at large scale. This means the pipeline that "works fine" today with a 2% intermittent failure rate will not stay at 2%. It will drift upward as your data grows, and by the time it becomes a serious operational problem, you will have dozens of models carrying the same class of hidden assumption. The Mental Model Worth Adopting Write your SQL as if the optimizer will always find the most inconvenient execution path. Assume it will scan the partition you hoped it would skip. Assume it will cast the side you didn't expect. Assume the plan will change tomorrow. If your query breaks under those assumptions, the query needs to be more explicit. Not the optimizer more predictable. A query that works because the optimizer happens to avoid a problematic data path is not a correct query. It is a lucky one. And luck is not an engineering strategy. The Takeaway False failures in dbt and Snowflake pipelines are not random. They are not gremlins. They are the predictable result of implicit assumptions meeting a dynamic execution engine that satisfies those assumptions by coincidence rather than by obligation. Recognizing this pattern, and separating it from genuine code bugs, is one of the most valuable diagnostic skills you can develop working in modern cloud data environments. Next time your pipeline fails and the code looks clean: stop auditing the logic. Start auditing the assumptions. Find what your query relies on implicitly. Make it explicit. Build tests that enforce it at the data layer, before the execution engine ever gets the chance to surprise you. Your code was fine. Your contract with the engine wasn't. Now you know the difference. More

Optimizing Databricks Spark Pipelines Using Declarative Patterns

By Seshendranath Balla Venkata

If you've ever inherited a Spark job that runs in 35 minutes and someone asks you to make it faster, you know the routine. You start by checking partition counts, then file sizes, then shuffle stages, then broadcast hints. You find a handwritten OPTIMIZE schedule from 2022, a Z-ORDER on the wrong column, and a cluster sized for last year's data volume. By the time you've made the job fast, you've absorbed three new things to maintain. The next person to inherit it will absorb four. This pattern — call it the hand-tuning treadmill — is what the declarative optimization story on Databricks is trying to break. It's not a single feature; it's a cluster of capabilities that collectively let teams describe what a table should look like and let the engine handle the physical optimizations. What follows is the practical view of those patterns: where they fit, what they replace, and how to migrate without a rewrite weekend. 1. The Hand-Tuning Treadmill: Why Imperative Optimization Doesn't Scale Before getting into the declarative side, it's worth being concrete about what "imperative Spark optimization" actually means in production. The shape is consistent across teams I've audited: Layout decisions frozen on day one. Somebody picks a partition column when the table is created. The data shape changes a year later. Nobody re-partitions because the migration is scary. Query plans drift toward full scans.Maintenance jobs that nobody owns. An OPTIMIZE / Z-ORDER / VACUUM script lives in a notebook scheduled at 3 AM. It runs on a cluster that's slightly mis-sized. When data volume grows, the job runs into the morning workload, and people complain about latency.Cluster sizing as a guess. Worker count is a heuristic from a senior engineer's memory of last year's spike. Half the time it's too big, half the time it's too small, and the cost discussion gets emotional.Hint-driven plans. Broadcast hints, repartition hints, coalesce (N) — sprinkled through pipelines to fix yesterday's problem, kept indefinitely because removing them feels risky. None of these are bugs. They're symptoms of the imperative model: the team owns the layout, the maintenance, the sizing, and the plan tuning. In small pipelines, ownership is fine. At scale, it becomes the bottleneck that the team can't outsource. 2. What "Declarative" Means in the Spark Optimization Context Declarative is a word that gets used in two different ways here, and it's worth pulling them apart. Within Lakeflow pipelines (formerly DLT), it means "describe the tables, not the steps" — the engine builds the DAG and runs it. But in the broader optimization story, declarative also means "describe the desired property of the table or workload, not the operations to maintain it": Layout: I want this table clustered by these columns; figure out when and how to re-cluster.Maintenance: I want this table optimized and vacuumed; figure out the schedule.Ingestion: I want all new files in this path picked up exactly once; figure out checkpointing and listing.Quality: These rows must satisfy these expectations; enforce them and report what gets dropped.Compute: I want this query fast and not wasteful; size and scale appropriately. Each one of those bullets corresponds to a piece of the declarative stack. Used together, they replace a remarkable amount of the boilerplate that has historically lived in Spark pipelines. The mental shift: You stop writing operations against the table and start writing properties of the table. The engine becomes the actor; you become the editor. 3. The Declarative Optimization Stack on Databricks The chart below maps each thing the team declares to the engine capability that handles it, ending at the physical Delta table. It's the picture I draw on whiteboards when teams ask, "What's the order to adopt these in?" Figure 1. The declarative optimization stack: each user-facing intent at the top maps to a continuous engine behavior, which keeps the underlying Delta tables well-clustered, compacted, and statistically up-to-date — without human intervention. Two things are worth highlighting in this picture. First, every box in the engine row is something that runs continuously, not on a cron — there is no daily "optimization window" anymore. Second, the bottom layer is identical to what you'd get from any well-tuned imperative pipeline: 256 MB Parquet files with current statistics. The declarative path doesn't change what good looks like; it changes who does the work to keep things looking good. 4. Layout: Liquid Clustering Replaces Hand-Maintained Z-ORDER Liquid Clustering is the change with the largest practical impact, because partition-key choices are where most lakehouse pipelines accumulate the most technical debt. The declarative version: you specify the columns the data is most often filtered or joined by, and the engine maintains a layout that supports those access patterns — incrementally, as new data arrives, without a full rewrite. When access patterns change, you change the cluster columns, and the engine re-clusters in the background. Defining Liquid-Clustered Tables SQL -- New table, clustered by the columns most commonly filtered on. -- No more PARTITIONED BY, no more guessing at partition cardinality. CREATE TABLE prod.gold.daily_totals ( account_id STRING, region STRING, ingest_date DATE, daily_total DECIMAL(18,2), txn_count BIGINT ) USING DELTA CLUSTER BY (region, ingest_date, account_id); -- Even better: let the engine pick the clustering columns by -- observing real query patterns over time. CREATE TABLE prod.gold.events_clustered USING DELTA CLUSTER BY AUTO AS SELECT * FROM prod.silver.events; Migrating an Existing Partitioned/Z-ORDER Table SQL -- Convert a legacy partitioned table to liquid clustering. -- Existing data files are not rewritten immediately; the engine -- rebalances incrementally on subsequent writes + maintenance. ALTER TABLE prod.silver.transactions CLUSTER BY (account_id, ingest_date); -- Force the first clustering pass for a freshly converted table OPTIMIZE prod.silver.transactions FULL; Why this matters: the recurring 2 AM Slack thread of "can we re-partition this table?" goes away. Layout becomes a property you change with one DDL statement, not a multi-week rewrite project. 5. Maintenance: Predictive Optimization Replaces Cron-Driven OPTIMIZE/VACUUM Predictive optimization is the part that retired the most legacy code in the pipelines I've migrated. Once enabled at the catalog or schema level, the engine monitors each table's read and write patterns and decides on its own when to compact files, re-cluster, vacuum, and refresh statistics. The big win isn't the operations themselves — the imperative pipeline could already run those — it's that the timing is observed-driven, not schedule-driven. Tables that get heavy ingestion get more frequent maintenance. Cold tables get left alone. SQL -- Turn it on at the catalog level once; new tables inherit. ALTER CATALOG prod SET PREDICTIVE OPTIMIZATION = ENABLED; -- Or at the schema level for a phased rollout ALTER SCHEMA prod.gold SET PREDICTIVE OPTIMIZATION = ENABLED; -- Inspect what the engine has been doing on a given table SELECT operation, operation_metrics.numFilesAdded AS files_added, operation_metrics.numFilesRemoved AS files_removed, operation_metrics.numOutputBytes AS output_bytes, timestamp FROM (DESCRIBE HISTORY prod.gold.daily_totals) WHERE userMetadata IS NULL -- engine-driven, not user AND operation IN ('OPTIMIZE', 'VACUUM') AND timestamp >= current_timestamp() - INTERVAL 7 DAYS ORDER BY timestamp DESC; What you should delete after enabling this: the nightly notebook that runs OPTIMIZE on every table in a schema, the VACUUM cron job, the ANALYZE TABLE wrapper, and the alerting that wakes someone up when those jobs run long. None of them are needed anymore, and leaving them on creates duplicate work that the engine and the cron will fight over. 6. Ingestion: Auto Loader Replaces Listing-Based File Detection Auto Loader is the declarative answer to the perennial "which files have we processed already?" problem. Instead of listing a directory, comparing it to a state file, and figuring out the new bits, you describe the source location and the format and let the engine maintain its own incremental state. It uses cloud-native event notifications (S3 events, ADLS notifications, or efficient directory listing as a fallback), and the checkpoint is just another piece of state the engine owns. Python from pyspark.sql.functions import current_timestamp # Streaming ingest from S3 with schema inference + evolution. # Replaces hand-maintained checkpointing, listing logic, and # whatever file-tracking table the team built two years ago. (spark.readStream .format("cloudFiles") .option("cloudFiles.format", "json") .option("cloudFiles.inferColumnTypes", "true") .option("cloudFiles.schemaLocation", "s3://acme-checkpoints/txns_schema") .option("cloudFiles.schemaEvolutionMode", "addNewColumns") .load("s3://landing/txns/") .withColumn("_ingest_ts", current_timestamp()) .writeStream .format("delta") .option("checkpointLocation", "s3://acme-checkpoints/txns_writer") .trigger(availableNow=True) # batch-style; runs to completion .toTable("prod.bronze.txns")) Two notes from production. First, schemaEvolutionMode is the option that prevents the silent-data-loss class of bugs when partner schemas change; pick the policy explicitly rather than letting it default. Second, trigger(availableNow=True) gives you batch ergonomics on a streaming source — the job runs until it has consumed everything and exits, which is what most teams actually want for daily ingestion. 7. Transforms and Quality: Declarative Pipelines Replace Bare Spark + External DQ The final piece is the transformation layer. Lakeflow pipelines (the rebrand of Delta Live Tables) let you declare each table as a Python or SQL definition, and add expectations as a first-class concept. The engine derives the DAG from the dependencies and enforces the expectations on every write — the data quality framework, the lineage layer, and the orchestration glue collapse into a single artifact. Python import dlt from pyspark.sql.functions import sum as _sum, col @dlt.table( name="silver_txns", table_properties={ "delta.enableChangeDataFeed": "true", "delta.tuneFileSizesForRewrites": "true", }, cluster_by=["account_id", "ingest_date"], ) @dlt.expect_or_drop("non_null_amount", "amount IS NOT NULL") @dlt.expect_or_fail("valid_currency", "currency IN ('USD','EUR','GBP')") @dlt.expect("unique_txn", "txn_id IS NOT NULL") def silver_txns(): return (dlt.read_stream("bronze_txns") .dropDuplicates(["txn_id"])) @dlt.table(name="gold_daily_totals") def gold_daily_totals(): return (dlt.read("silver_txns") .groupBy("ingest_date", "account_id", "region") .agg(_sum("amount").alias("daily_total"))) The decorators do four things at once: define the table, declare its layout (cluster_by), declare its quality rules, and let the engine infer that gold_daily_totals depends on silver_txns from the dlt.read call. There is no DAG file. There is no separate Great Expectations suite. Lineage is generated for free in Unity Catalog, including column-level edges. If you want to query how the expectations have been performing — useful for SLO dashboards or alerting — the event log surfaces it directly: SQL -- Pass / fail / drop counts per expectation, last 24 hours SELECT flow_name, details:flow_progress.data_quality.expectations[0].name AS exp_name, details:flow_progress.data_quality.expectations[0].passed_records AS passed, details:flow_progress.data_quality.expectations[0].failed_records AS failed, details:flow_progress.data_quality.expectations[0].dropped_records AS dropped, timestamp FROM event_log("<pipeline-id>") WHERE event_type = 'flow_progress' AND timestamp >= current_timestamp() - INTERVAL 1 DAY ORDER BY timestamp DESC; 8. Putting It Together: Where to Start, What to Measure Adopting all of this at once is a recipe for pain. The order I've seen work, and a small set of metrics to verify the change is paying off: Step Adopt Retire Verify with 1 Predictive optimization at schema level Nightly OPTIMIZE / VACUUM jobs Reduction in maintenance-cluster cost 2 Liquid clustering on top 5 tables Static partitioning + Z-ORDER p95 query latency on the same workloads 3 Auto loader for 1-2 ingestion pipelines Custom file-tracking + listing logic End-to-end data freshness 4 Lakeflow pipelines for new pipelines only External DQ + DAG glue (for new work) Lines of pipeline code per table 5 Serverless compute for SQL warehouses + DLT Hand-sized job clusters Cost-per-query, scale-up time What you do not need to migrate: imperative pipelines that already work and aren't growing. Declarative patterns are about new work and high-pain hot spots, not a heroic rewrite of every notebook ever shipped. 9. Honest Limitations and Where Imperative Still Wins Three places where the declarative model still bites — worth knowing before you commit: Procedural logic still belongs in Jobs. If your pipeline is really a sequence of API calls with branching error handling, that's a Lakeflow Job (or external code), not a declarative table. Don't try to bend dlt around it.Predictive optimization needs observation time. On a table that's a week old, the engine hasn't seen enough patterns to make great decisions. For tables under heavy initial load, an explicit OPTIMIZE FULL after the first big ingest still helps.Cluster-by-column choice still matters. CLUSTER BY AUTO is great for stable workloads with predictable filters. For tables whose access pattern is genuinely heterogeneous across teams, an explicit cluster-by based on the dominant query is usually faster.Hint-driven escapes are still allowed. If a particular query benefits from a /*+ BROADCAST(t) */ hint and AQE isn't catching it, the hint is fine. Just keep them rare and document why. Conclusion The declarative optimization story isn't a single feature you toggle — it's a quiet shift in who owns the boring parts of a Spark pipeline. Layout, maintenance, ingestion bookkeeping, plan tuning, cluster sizing, data quality enforcement: every one of those was traditionally a thing the team owned and paid for in toil. The current Databricks stack lets you express each as an intent and let the engine handle the operations underneath. Adopt them in order, retire what they replace, and the optimization treadmill slows from a daily concern to a quarterly review. That's the actual win, and it's the reason the declarative paradigm has gone from a Lakeflow detail to the default mental model for new pipelines on Databricks. More

Offline-First Patch Management for 10,000 Edge Nodes: A Practical Architecture That Scales

By srinivas thotakura

Event-Driven Pipelines With Apache Pulsar and Go

By Shivi Kashyap

Zero-Downtime Deployments for Java Apps on Kubernetes

By Ramya vani Rayala

Pragmatica Aether: Let Java Be Java

The Aberration We build Java applications like Go or Rust programs. Fat JARs. Docker images. Kubernetes deployments. Everyone does it, so it looks normal. It contradicts Java’s design DNA. Java has always been a language for managed environments. Applets ran inside browsers. Servlets ran inside application servers. EJBs ran inside containers like JBoss and WebLogic. OSGi bundles ran inside runtime containers like Eclipse Equinox. In every generation, the pattern was the same: a managed runtime hosts the application. The application handles business logic. The runtime handles infrastructure. The fat-jar era threw that away. We stopped letting Java be Java. We started bundling web servers, serialization frameworks, service discovery clients, configuration management, health checks, metrics libraries, and logging frameworks into every application. Then we wrapped the result in a Docker container and deployed it to an orchestration platform that reimplements — poorly — the infrastructure management that Java runtimes used to provide natively. This article introduces Pragmatica Aether: a distributed runtime that returns Java to its natural habitat. The application handles business logic. Runtime handles infrastructure. This isn’t radical — it's returning to what Java was designed for. The Problem: Infrastructure Wearing a Business Logic Mask Think of what a typical Java microservice carries. A web server (Tomcat, Netty, Undertow). A serialization framework (Jackson, Gson). A dependency injection container (Spring, Guice). A service discovery client (Eureka, Consul). Health check endpoints. Configuration management (Spring Cloud Config, Consul KV). A metrics library (Micrometer, Dropwizard). A logging framework (Logback, Log4j2). Retry logic (Resilience4j). Circuit breakers. HTTP client configuration. The application is wearing a heavy winter coat of infrastructure, armed to the teeth to survive in a hostile environment. Now consider the coupling this creates. Update the Java version — rebuild and test every service. Change your message broker from RabbitMQ to Kafka — modify, rebuild, and redeploy every application that touches messaging. Add a new observability tool and update dependencies in every microservice. Switch cloud providers — rewrite configuration, SDK calls, and deployment manifests across the entire fleet. Each change ripples through dozens or hundreds of services because infrastructure is entangled with business logic at the dependency level. This is the coupling trap. Your application’s pom.xml doesn't distinguish between business dependencies and infrastructure dependencies. They compile together, deploy together, and break together. A security patch in Netty requires a new build of every service that embeds a web server, which is all of them. Framework lock-in worsens this. It isn’t a vendor problem — it's an architecture problem. Spring’s dependency injection fights with Kubernetes service mesh for control over service routing and circuit breaking. The framework’s configuration system overlaps with Consul KV and Kubernetes ConfigMaps. Your cloud SDK’s retry logic conflicts with Resilience4j. Every layer claims authority over the same cross-cutting concerns, and the conflicts surface as subtle bugs in production — not during development. This is an architecture problem. Architectural problems have architectural solutions. Aether: The Core Idea What you write: an interface annotated with @Slice, plus business logic implementation. Java @Slice public interface OrderService { Promise<OrderResult> placeOrder(PlaceOrderRequest request); static OrderService orderService(InventoryService inventory, PricingEngine pricing) { return request -> inventory.check(request.items()) .flatMap(available -> pricing.calculate(available)) .map(priced -> OrderResult.placed(priced)); } } What you don’t write: everything else. No HTTP clients — inter-slice calls are direct method invocations via generated proxies. No service discovery — the runtime tracks where every slice instance lives. No retry logic — built-in retry with exponential backoff and node failover. No circuit breakers — the reliability fabric handles failure automatically. No serialization code — request/response types are serialized transparently. A method call via an imported interface is the only visible contract. The only hint that the actual call might be remote is a design requirement: slice methods should be idempotent. This isn’t a limitation — it's what enables retry, scaling, and fault tolerance to work transparently. The same request, processed by any available instance, produces the same result. Most read operations are naturally idempotent. For writes, standard patterns like idempotency keys and conditional writes handle it cleanly. Everything else is the environment’s job: resource provisioning, scaling, transport, discovery, retries, circuit breakers, configuration, observability, logging, tracing, monitoring, and security. None of these are application concerns, and none should be handled at the business logic level. The JBCT Leaf pattern serves two purposes here: it documents the design (“what we expect from an external implementation”) and encourages exactly one interface per dependency. Different implementations may have different technical properties — performance, latency, memory consumption — but as long as they’re compatible with the interface, business logic works unchanged. You write basically pure business logic that scales from your local computer to a global multi-zone distributed deployment, transparently. Under The Hood: What Makes It Work Five architectural decisions make this possible. Consensus KV Store. A single source of truth for all configuration, deployment state, and service discovery. Based on the Rabia protocol, a crash-fault-tolerant, leaderless consensus algorithm was published in 2021. Any node can propose; agreement is reached through a two-round voting protocol with a fast path when a supermajority agrees in round one. No external config servers. No etcd. No Consul. Configuration changes propagate through consensus and take effect cluster-wide. Built-in Artifact Repository. DHT-based storage with configurable replication — 3 replicas with quorum reads/writes in production, full replication in development. Artifacts are chunked into 64KB pieces, distributed across nodes via consistent hashing, and integrity-verified with MD5 and SHA-1 on every resolve. No external Nexus or Artifactory is needed. During development, slices resolve from your local Maven repository. In production, the cluster is self-contained. ClassLoader Isolation. Each slice runs inside its own SliceClassLoader with child-first delegation. Two slices can use different versions of the same library without conflict. Shared dependencies like Pragmatica Lite core are loaded once in a parent classloader. No dependency conflicts. No classpath hell between slices. Declarative Deployment. Blueprints — TOML files — describe the desired state: which slices, how many instances. TOML id = "org.example:commerce:1.0.0" [[slices]] artifact = "org.example:inventory-service:1.0.0" instances = 3 [[slices]] artifact = "org.example:order-processor:1.0.0" instances = 5 Apply with one command: aether blueprint apply commerce.toml. The cluster resolves artifacts, loads slices, distributes instances across nodes, registers routes, and starts serving traffic. The cluster converges to the desired state automatically. Infrastructure Independence. Aether nodes are identical — there's only one deployment artifact to manage at the infrastructure level. Node updates and application deployments run on completely independent schedules. Update Java — roll it out across nodes without touching applications. Update the Aether runtime — same. Update business logic — deploy new slice versions without touching infrastructure. Each independently, each without downtime. This is the fundamental benefit of proper separation: when layers don’t share a deployment unit, they don’t share a deployment schedule. Fault Tolerance: The 50% Rule The system survives the failure of less than half the nodes. Performance may degrade until replacements spin up, but functionality remains intact — actual redundancy, not just graceful degradation. A 5-node cluster tolerates 2 simultaneous failures. A 7-node cluster tolerates 3. The same request, processed by any available node, produces the same result. Quorum requires (N/2) + 1 nodes — as long as a majority is alive, the cluster operates normally. Leader failover is consensus-based and near-instant. Node replacement happens automatically — the Cluster Deployment Manager detects the deficit and provisions a replacement through the NodeProvider interface. The entire recovery sequence — from failure detection through state restoration to serving traffic — completes without human intervention. When a node fails, the recovery is automatic. Requests to slices on the failed node are immediately retried on healthy nodes. A replacement node is provisioned. It connects to peers, restores consensus state from a cluster snapshot, re-resolves artifacts from the DHT, and reactivates assigned slices. Dead nodes are automatically removed from routing tables. The new leader reconciles the stale state. No human intervention required. Rolling updates leverage this fault tolerance for zero-downtime deployments with weighted traffic routing: SQL aether update start org.example:order-processor 2.0.0 -n 3 aether update routing <id> -r 1:3 # 25% to v2, 75% to v1 aether update routing <id> -r 1:1 # 50/50 aether update complete <id> # 100% to v2, drain v1 Deploy during business hours. Shift traffic gradually — 10% canary, then 25%, 50%, 75%, 100%. Monitor health metrics at each step. If health degrades — error rate exceeds thresholds, latency spikes — instant rollback with one command: aether update rollback <id>. Traffic immediately shifts back to the old version. The 3 AM pager alert becomes an audit log entry. For Every Project: Legacy, Greenfield, And Everything Between Legacy Migration Your legacy Java system doesn’t need a complete rewrite. It needs a path forward. Pick a relatively independent part of your system — something hitting limits, something with clear boundaries. Extract an interface. Annotate it with @Slice. Wrap the legacy implementation: Java private Promise<Report> generateReport(ReportRequest request) { return Promise.lift(() -> legacyReportService.generate(request)); } One line to enter the Aether world. Promise.lift() wraps the legacy call, catches exceptions, and returns a proper Result inside a Promise. Your legacy code keeps running. Call sites don't change. You haven't added risk — the initial deployment in Ember runs in the same JVM as your existing application, which means it's no worse than what you have today. You've laid the foundation for removing risk, not adding it. Moving from Ember to a full Aether cluster is a configuration change, not a code change — and that's when the 50% rule starts to apply. From there, it’s the strangler fig pattern. Extract a hot path, deploy it as a slice, route traffic, repeat. Each extracted slice can be gradually refactored using the peeling pattern: first wrap everything in Promise.lift(), then decompose into a Sequencer with each step still wrapped, then peel individual steps into clean JBCT patterns. Tests pass at every step. The lift() calls mark exactly where legacy code remains, making progress visible and remaining work obvious. No rewrite is required. No big bang migration. One sprint to the first slice in production. The migration article covers the full path in detail — from initial wrapping through gradual peeling to clean JBCT code. Greenfield Development For new projects, slices enable a granularity that’s impossible with traditional microservices. Each slice can be as lean as a single method — and that’s the recommended approach. There are no operational or complexity tradeoffs for small slices because Aether handles all the infrastructure overhead. No container to configure, no load balancer to provision, no monitoring to set up per service. You get per-use-case scaling: one slice serving 50 instances during peak load while another idles at minimum. That kind of granularity would be operationally insane with traditional microservices — each needing its own container, load balancer, monitoring, and deployment pipeline. With Aether, it’s the default. JBCT patterns — Leaf, Sequencer, Fork-Join, Condition, Iteration, and Aspects — compose naturally within slices. Each slice method is a data transformation pipeline: parse input, gather data, process, respond. The patterns provide consistent structure within slices. Slices provide consistent boundaries between them. The Spectrum Same slice model, different granularity. A service slice wraps an entire legacy component. A lean slice implements a single method. Both coexist in the same cluster, deployed and scaled independently. Slice is the executable unit. It can be big or small as necessary and convenient. The architecture accommodates both monolith migration and greenfield development simultaneously. Your legacy system gains fault tolerance while new features get maximum deployment flexibility. Scaling: Two Levels, Three Tiers of Intelligence Two-Level Horizontal Scaling Aether scales in two dimensions independently: Slice scaling: Spin up more instances of a specific slice on existing nodes. Classes are already loaded—scaling takes milliseconds, not seconds.Node scaling: Add more machines to the cluster. The node connects, restores state, and begins accepting work. Independent controls, combined effect. Each node hosts at most one instance of a given slice, so scaling a slice beyond the current node count requires adding nodes first. Add 2 more nodes to a 3-node cluster, then scale a hot slice to 5 instances—one per node. No coordination between the two dimensions is required. Three-Tier Decision System Tier 1—Decision Tree (1-second intervals) Instant reactive decisions based on CPU utilization, request latency, queue depth, and error rate. CPU above 70%? Add an instance. Below 30% sustained? Remove one (if above minimum). Latency exceeding the P95 threshold? Scale up. Error rate above 1% due to timeouts? Scale up. Deterministic, predictable, fast. Handles routine load changes with configurable cooldown periods — 30 seconds for scale-up, 5 minutes for scale-down — to prevent oscillation. Tier 2—TTM Predictor (60-second intervals) An ONNX-based machine learning model (Tiny Time Mixers) analyzes a 60-minute sliding window of metrics — CPU usage, request rate, P95 latency, and active instances. Forecasts load and adjusts the Decision Tree’s thresholds preemptively. If TTM predicts a load increase, it lowers the scale-up CPU threshold by 20% so the reactive tier responds earlier. The cluster scales before the spike arrives, not after. The key design principle: the cluster always survives on Tier 1 alone. TTM enhances; it doesn’t replace. If TTM fails — model load error, insufficient data, inference failure — the Decision Tree continues with default thresholds. The error is logged and recorded in metrics. No scaling disruption. Tier 3—LLM-based (planned) Long-term capacity planning and cluster health monitoring. Seasonal pattern prediction, maintenance window planning, anomaly investigation. This tier is not yet implemented — the current system operates with Tiers 1 and 2. Fault tolerance makes preemptible instances viable for burst scaling. If a spot instance gets reclaimed, the cluster survives — it was designed for nodes to disappear. You don’t need a PhD in distributed systems or a dedicated platform team. The scaling system manages itself. Development Experience: From Laptop To Production Three Environments, Zero Code Changes Ember Single-process runtime with multiple cluster nodes running in the same JVM. Fast startup, simple debugging. Deploy your slices alongside your existing application — slices call each other directly in-process. No network overhead. Standard debugger breakpoints work as expected. Perfect for local development and unit testing. Forge A 5-node cluster simulator running on your laptop. Real consensus. Real routing. Real failure scenarios. Kill nodes, crash the leader, trigger rolling restarts — and watch the cluster recover in real time through a web dashboard with D3.js topology visualization, per-node metrics (CPU, heap, leader status), and event timeline. Configurable load generation with TOML-based multi-target configuration lets you stress-test realistic scenarios — set request rates, define body templates, and run duration-limited load tests. Chaos operations include node kill, leader kill, and rolling restart. Forge validates the entire dependency graph before starting anything. Aether Production cluster. Same slices, same code, different scale. Your code doesn’t know which environment it’s running in. Whether inter-slice calls are in-process or cross-network is transparent. Tooling 37 CLI commands cover deployment, scaling, updates, artifacts, observability, controller configuration, and alerts — in both single-command and interactive REPL modes. A web dashboard streams real-time metrics via WebSocket — no polling. 30+ REST management endpoints enable full programmatic control of everything the CLI can do. Prometheus-compatible metrics export (/metrics/prometheus) integrates with existing monitoring stacks. Metrics are push-based at 1-second intervals, with zero consensus overhead — they bypass the consensus protocol entirely. Per-method invocation tracking with P50/P95/P99 latency and configurable slow-invocation detection strategies (fixed threshold, adaptive, per-method, composite) surfaces performance issues before users notice. Dynamic aspects let you toggle LOG/METRICS/LOG_AND_METRICS modes per method at runtime via REST API, without redeployment. Test realistic failure scenarios on your laptop. Deploy to production with a config change, not a code change. Maturity Aether is a working system, not a concept paper. 81 end-to-end tests are run against real 5-node clusters in Podman containers, validating cluster formation, quorum establishment, slice deployment and scaling, blueprint application with topological ordering, multi-instance distribution, artifact upload, and cross-node resolution with integrity verification, leader failure and recovery, node restart with state restoration, and orphaned state cleanup after leader changes. The recovery and fault tolerance claims come from automated tests against real clusters, not marketing slides. Let Java Be Java Java’s lineage leads here. From applets managed by browsers, through servlets managed by application servers, through EJBs managed by enterprise containers, through OSGi managed by runtime frameworks, to Aether, managed by a distributed runtime. The fat-jar era was a detour. An understandable one — when Docker emerged, it offered a universal packaging format, and the industry standardized on it regardless of language. Java adopted the patterns of languages that were designed to produce standalone binaries. We started treating Java applications like Go programs with a heavier runtime. But it was never the destination. Java was designed for managed environments. The JVM makes it possible. The runtime manages the application. That’s the lineage. Aether continues it. Two entry points exist today. Wrap your legacy monolith behind a @Slice interface in one sprint and gain fault tolerance without rewriting anything. Or start fresh with maximum clarity — lean slices, explicit contracts, per-use-case scaling. Both paths converge on the same runtime, the same cluster, the same operational model. Both paths can coexist — legacy service slices and new lean slices running side by side. Fault tolerance is not an afterthought — it's the foundation. Scaling is not your problem — it's the environment’s. Infrastructure is not your code — it's the runtime’s. The heavy winter coat comes off. The application breathes. Resources Pragmatica Aether—project siteGitHub Repository—source code

By Sergiy Yevtushenko

Contract-First Integration: Building Scalable Systems With Flyway, OpenAPI, and Kafka

After implementing contract-first integration across three different microservices architectures, I've learned that the biggest bottleneck in distributed systems isn't technical; it's coordination between teams. When Team A waits for Team B to finish their API before starting integration work, you're throwing away weeks of productivity. Contract-first development flips this model. By defining your integration contracts upfront (OpenAPI specs, Avro schemas, database migrations), you enable teams to work in parallel, catch breaking changes early through CI validation, and treat contracts as the single source of truth. This isn't theoretical; this is how Netflix, Uber, and Amazon scale their engineering organizations. In this article, I'll show you production-ready contract-first patterns using Java 21, Spring Boot 3, OpenAPI 3, Apache Kafka with Avro, and Flyway migrations. You'll see real code from a working system that handles the three critical integration boundaries: REST APIs, event-driven messaging, and database schemas. The Problem: Why Traditional Integration Fails at Scale When systems integrate without contracts, you hit three major problems: 1. Serial development bottlenecks: Team A builds an API endpoint. Team B waits. Team B builds a consumer. Team A discovers the payload doesn't match what Team B expected. Both teams spend days debugging mismatched assumptions. 2. Late discovery of breaking changes: You deploy a service update that changes a response field from customerId to customer_id. Your API consumers break in production. No tests caught it because there was no contract to validate against. 3. Documentation drift: The Swagger docs say the endpoint returns a 201. The actual code returns a 200. The integration tests expect a 404. Nobody knows which one is right because there's no single source of truth. Contract-first development solves all three by making the contract the authoritative specification that generates code, mocks, tests, and documentation. What Contract-First Actually Means Contract-first means you define the integration boundary first (the contract) and then write code that conforms to it. The contract is not an afterthought or generated documentation. It's the design specification. A complete contract includes: Operations: Endpoints (REST), topics (Kafka), or tables (database)Data shapes: Request/response DTOs, event schemas, column definitionsValidation rules: Required fields, constraints, data typesError model: HTTP status codes, error payloads, dead-letter queuesNon-functional rules: Idempotency, retries, compatibility policies, SLAs Here's the mental model: Agree on the contract → generate tools → build independently → let CI enforce alignment. Contract Type 1: REST API Contracts With OpenAPI OpenAPI 3 specs are the gold standard for REST API contracts. You define endpoints, request/response schemas, validation rules, and error responses in YAML. Then you generate server stubs, client SDKs, mocks, and documentation from that single source. OpenAPI Contract Example Here's a production OpenAPI contract for an order management API: YAML openapi: 3.2.0 info: title: Orders API version: 1.0.0 description: Contract-first REST API for order management paths: /v1/orders: post: operationId: createOrder summary: Create a new order requestBody: required: true content: application/json: schema: $ref: '#/components/schemas/CreateOrderRequest' responses: '201': description: Order created successfully content: application/json: schema: $ref: '#/components/schemas/OrderResponse' '400': description: Validation error content: application/json: schema: $ref: '#/components/schemas/ErrorResponse' '409': description: Idempotency conflict content: application/json: schema: $ref: '#/components/schemas/ErrorResponse' /v1/orders/{orderId}: get: operationId: getOrder parameters: - name: orderId in: path required: true schema: type: string responses: '200': description: Order found content: application/json: schema: $ref: '#/components/schemas/OrderResponse' '404': description: Order not found components: schemas: CreateOrderRequest: type: object required: [customerId, items] properties: customerId: type: string example: CUST-123 idempotencyKey: type: string description: Optional key for safe retries items: type: array minItems: 1 items: $ref: '#/components/schemas/OrderItem' OrderItem: type: object required: [sku, quantity] properties: sku: type: string example: SKU-001 quantity: type: integer minimum: 1 OrderResponse: type: object required: [orderId, customerId, status, items, timestamp] properties: orderId: type: string customerId: type: string status: type: string enum: [CREATED, REJECTED] items: type: array items: $ref: '#/components/schemas/OrderItem' timestamp: type: string format: date-time ErrorResponse: type: object required: [code, message, traceId, timestamp] properties: code: type: string enum: [VALIDATION_ERROR, NOT_FOUND, CONFLICT, INTERNAL_ERROR] message: type: string traceId: type: string timestamp: type: string format: date-time Implementing the Provider Side (Spring Boot) The contract drives the implementation. Your Spring Boot controller implements what the contract specifies: Java @RestController @RequestMapping("/v1/orders") @RequiredArgsConstructor @Slf4j public class OrderController { private final OrderService orderService; /** * POST /v1/orders * Contract: contracts/openapi/orders-api.v1.yaml */ @PostMapping public ResponseEntity<OrderResponse> createOrder( @Valid @RequestBody CreateOrderRequest request) { log.info("Creating order for customer: {}", request.customerId()); OrderResponse response = orderService.createOrder(request); return ResponseEntity .status(HttpStatus.CREATED) .body(response); } /** * GET /v1/orders/{orderId} */ @GetMapping("/{orderId}") public ResponseEntity<OrderResponse> getOrder(@PathVariable String orderId) { OrderResponse response = orderService.getOrder(orderId) .orElseThrow(() -> new ResourceNotFoundException("Order not found: " + orderId)); return ResponseEntity.ok(response); } } Critical implementation details: DTOs match contract schemas exactly: CreateOrderRequest and OrderResponse are Java records generated from or validated against the OpenAPI specHTTP status codes match contract: 201 for creation, 404 for not found, 409 for idempotency conflictsValidation is enforced: @Valid annotation ensures request validation matches OpenAPI constraintsError responses are standardized: All errors return ErrorResponse with consistent structure Consumer Parallel Development Here's where contract-first shines. While your team implements the provider, the consumer team can: Generate a Java client from the OpenAPI spec using openapi-generatorRun a mock server that returns valid responses based on the contractWrite integration tests against the mockSwitch to the real service when it's ready (no code changes needed) The consumer doesn't wait for you to finish. They develop in parallel. Contract Type 2: Event Contracts With Kafka and Avro Event-driven systems need two contract layers: topic semantics (human-readable) and schema definitions (machine-validated). Topic Semantics Contract Document the operational contract for each topic: Markdown ## Topic: orders.order-created.v1 - **Purpose**: Emitted when an order is created successfully - **Key**: orderId (partition affinity per order) - **Delivery**: At-least-once (consumers must be idempotent) - **Consumer requirement**: Deduplicate by eventId - **Retry policy**: Consumer retries transient errors - **DLQ**: orders.order-created.v1.dlq for poison messages - **Compatibility**: Backward compatible schema evolution required Avro Schema Contract The Avro schema is your machine-validated contract: Java { "type": "record", "name": "OrderCreated", "namespace": "com.acme.events", "doc": "Event emitted when an order is successfully created", "fields": [ { "name": "eventId", "type": "string", "doc": "Unique event ID for idempotent processing" }, { "name": "occurredAt", "type": "string", "doc": "ISO 8601 timestamp" }, { "name": "orderId", "type": "string" }, { "name": "customerId", "type": "string" }, { "name": "source", "type": ["null", "string"], "default": null, "doc": "Order source (WEB, MOBILE, API). Nullable for backward compatibility." }, { "name": "items", "type": { "type": "array", "items": { "type": "record", "name": "OrderItem", "fields": [ {"name": "sku", "type": "string"}, {"name": "quantity", "type": "int"} ] } } } ] } Key pattern: The source field is nullable with a default value. This supports backward-compatible evolution; old consumers can read new events, and new consumers can read old events. Kafka Producer Implementation Java @Component @RequiredArgsConstructor @Slf4j public class OrderEventPublisher { private final KafkaTemplate<String, Object> kafkaTemplate; public void publishOrderCreated(OrderCreated event) { String key = event.getOrderId(); log.debug("Publishing OrderCreated: orderId={}, eventId={}", key, event.getEventId()); CompletableFuture<SendResult<String, Object>> future = kafkaTemplate.send("orders.order-created.v1", key, event); future.whenComplete((result, ex) -> { if (ex == null) { log.info("Published OrderCreated: orderId={}, partition={}, offset={}", key, result.getRecordMetadata().partition(), result.getRecordMetadata().offset()); } else { log.error("Failed to publish OrderCreated: orderId={}", key, ex); } }); } } Production concerns addressed: Key-based partitioning: Using orderId as the key ensures all events for the same order go to the same partition, maintaining orderingAsync publishing with callbacks: Non-blocking publish with explicit success/failure handlingStructured logging: Captures partition and offset for troubleshooting Kafka Consumer With Idempotency At-least-once delivery means duplicates are possible. Consumers must deduplicate: Java @KafkaListener(topics = "orders.order-created.v1", groupId = "billing-service") public void onOrderCreated(OrderCreated event) { // Check if already processed if (processedEventsRepository.existsByEventId(event.getEventId())) { log.debug("Skipping duplicate event: {}", event.getEventId()); return; } // Process event billingService.createInvoice( event.getOrderId(), event.getCustomerId(), event.getItems() ); // Mark as processed processedEventsRepository.save( new ProcessedEvent(event.getEventId(), Instant.now()) ); } Idempotency pattern: Check eventId before processing, store it after processing. If the same event arrives twice, the second one is ignored. Architecture Diagram: Contract-First Flow After implementing contract-first integration across three different microservices architectures, I've learned that the biggest bottleneck in distributed systems isn't technical; it's coordination between teams. When Team A waits for Team B to finish their API before starting integration work, you're throwing away weeks of productivity. Contract-first development flips this model. By defining your integration contracts upfront (OpenAPI specs, Avro schemas, database migrations), you enable teams to work in parallel, catch breaking changes early through CI validation, and treat contracts as the single source of truth. This isn't theoretical; this is how Netflix, Uber, and Amazon scale their engineering organizations. In this article, I'll show you production-ready contract-first patterns using Java 21, Spring Boot 3, OpenAPI 3, Apache Kafka with Avro, and Flyway migrations. You'll see real code from a working system that handles the three critical integration boundaries: REST APIs, event-driven messaging, and database schemas. The Problem: Why Traditional Integration Fails at Scale When systems integrate without contracts, you hit three major problems: 1. Serial development bottlenecks: Team A builds an API endpoint. Team B waits. Team B builds a consumer. Team A discovers the payload doesn't match what Team B expected. Both teams spend days debugging mismatched assumptions. 2. Late discovery of breaking changes: You deploy a service update that changes a response field from customerId to customer_id. Your API consumers break in production. No tests caught it because there was no contract to validate against. 3. Documentation drift: The Swagger docs say the endpoint returns a 201. The actual code returns a 200. The integration tests expect a 404. Nobody knows which one is right because there's no single source of truth. Contract-first development solves all three by making the contract the authoritative specification that generates code, mocks, tests, and documentation. What Contract-First Actually Means Contract-first means you define the integration boundary first (the contract) and then write code that conforms to it. The contract is not an afterthought or generated documentation. It's the design specification. A complete contract includes: Operations: Endpoints (REST), topics (Kafka), or tables (database)Data shapes: Request/response DTOs, event schemas, column definitionsValidation rules: Required fields, constraints, data typesError model: HTTP status codes, error payloads, dead-letter queuesNon-functional rules: Idempotency, retries, compatibility policies, SLAs Here's the mental model: Agree on the contract → generate tools → build independently → let CI enforce alignment. Contract Type 1: REST API Contracts With OpenAPI OpenAPI 3 specs are the gold standard for REST API contracts. You define endpoints, request/response schemas, validation rules, and error responses in YAML. Then you generate server stubs, client SDKs, mocks, and documentation from that single source. OpenAPI Contract Example Here's a production OpenAPI contract for an order management API: YAML openapi: 3.2.0 info: title: Orders API version: 1.0.0 description: Contract-first REST API for order management paths: /v1/orders: post: operationId: createOrder summary: Create a new order requestBody: required: true content: application/json: schema: $ref: '#/components/schemas/CreateOrderRequest' responses: '201': description: Order created successfully content: application/json: schema: $ref: '#/components/schemas/OrderResponse' '400': description: Validation error content: application/json: schema: $ref: '#/components/schemas/ErrorResponse' '409': description: Idempotency conflict content: application/json: schema: $ref: '#/components/schemas/ErrorResponse' /v1/orders/{orderId}: get: operationId: getOrder parameters: - name: orderId in: path required: true schema: type: string responses: '200': description: Order found content: application/json: schema: $ref: '#/components/schemas/OrderResponse' '404': description: Order not found components: schemas: CreateOrderRequest: type: object required: [customerId, items] properties: customerId: type: string example: CUST-123 idempotencyKey: type: string description: Optional key for safe retries items: type: array minItems: 1 items: $ref: '#/components/schemas/OrderItem' OrderItem: type: object required: [sku, quantity] properties: sku: type: string example: SKU-001 quantity: type: integer minimum: 1 OrderResponse: type: object required: [orderId, customerId, status, items, timestamp] properties: orderId: type: string customerId: type: string status: type: string enum: [CREATED, REJECTED] items: type: array items: $ref: '#/components/schemas/OrderItem' timestamp: type: string format: date-time ErrorResponse: type: object required: [code, message, traceId, timestamp] properties: code: type: string enum: [VALIDATION_ERROR, NOT_FOUND, CONFLICT, INTERNAL_ERROR] message: type: string traceId: type: string timestamp: type: string format: date-time Implementing the Provider Side (Spring Boot) The contract drives the implementation. Your Spring Boot controller implements what the contract specifies: Java @RestController @RequestMapping("/v1/orders") @RequiredArgsConstructor @Slf4j public class OrderController { private final OrderService orderService; /** * POST /v1/orders * Contract: contracts/openapi/orders-api.v1.yaml */ @PostMapping public ResponseEntity<OrderResponse> createOrder( @Valid @RequestBody CreateOrderRequest request) { log.info("Creating order for customer: {}", request.customerId()); OrderResponse response = orderService.createOrder(request); return ResponseEntity .status(HttpStatus.CREATED) .body(response); } /** * GET /v1/orders/{orderId} */ @GetMapping("/{orderId}") public ResponseEntity<OrderResponse> getOrder(@PathVariable String orderId) { OrderResponse response = orderService.getOrder(orderId) .orElseThrow(() -> new ResourceNotFoundException("Order not found: " + orderId)); return ResponseEntity.ok(response); } } Critical implementation details: DTOs match contract schemas exactly: CreateOrderRequest and OrderResponse are Java records generated from or validated against the OpenAPI specHTTP status codes match contract: 201 for creation, 404 for not found, 409 for idempotency conflictsValidation is enforced: @Valid annotation ensures request validation matches OpenAPI constraintsError responses are standardized: All errors return ErrorResponse with consistent structure Consumer Parallel Development Here's where contract-first shines. While your team implements the provider, the consumer team can: Generate a Java client from the OpenAPI spec using openapi-generatorRun a mock server that returns valid responses based on the contractWrite integration tests against the mockSwitch to the real service when it's ready (no code changes needed) The consumer doesn't wait for you to finish. They develop in parallel. Contract Type 2: Event Contracts With Kafka and Avro Event-driven systems need two contract layers: topic semantics (human-readable) and schema definitions (machine-validated). Topic Semantics Contract Document the operational contract for each topic: Markdown ## Topic: orders.order-created.v1 - **Purpose**: Emitted when an order is created successfully - **Key**: orderId (partition affinity per order) - **Delivery**: At-least-once (consumers must be idempotent) - **Consumer requirement**: Deduplicate by eventId - **Retry policy**: Consumer retries transient errors - **DLQ**: orders.order-created.v1.dlq for poison messages - **Compatibility**: Backward compatible schema evolution required Avro Schema Contract The Avro schema is your machine-validated contract: JSON { "type": "record", "name": "OrderCreated", "namespace": "com.acme.events", "doc": "Event emitted when an order is successfully created", "fields": [ { "name": "eventId", "type": "string", "doc": "Unique event ID for idempotent processing" }, { "name": "occurredAt", "type": "string", "doc": "ISO 8601 timestamp" }, { "name": "orderId", "type": "string" }, { "name": "customerId", "type": "string" }, { "name": "source", "type": ["null", "string"], "default": null, "doc": "Order source (WEB, MOBILE, API). Nullable for backward compatibility." }, { "name": "items", "type": { "type": "array", "items": { "type": "record", "name": "OrderItem", "fields": [ {"name": "sku", "type": "string"}, {"name": "quantity", "type": "int"} ] } } } ] } Key pattern: The source field is nullable with a default value. This supports backward-compatible evolution; old consumers can read new events, and new consumers can read old events. Kafka Producer Implementation Java @Component @RequiredArgsConstructor @Slf4j public class OrderEventPublisher { private final KafkaTemplate<String, Object> kafkaTemplate; public void publishOrderCreated(OrderCreated event) { String key = event.getOrderId(); log.debug("Publishing OrderCreated: orderId={}, eventId={}", key, event.getEventId()); CompletableFuture<SendResult<String, Object>> future = kafkaTemplate.send("orders.order-created.v1", key, event); future.whenComplete((result, ex) -> { if (ex == null) { log.info("Published OrderCreated: orderId={}, partition={}, offset={}", key, result.getRecordMetadata().partition(), result.getRecordMetadata().offset()); } else { log.error("Failed to publish OrderCreated: orderId={}", key, ex); } }); } } Architecture Diagram: Contract-First Flow Figure 1: Contract-first development flow showing parallel team development Note: Solid arrows represent generation or implementation flow. Dashed arrows represent validation relationships. Contract Type 3: Database Contracts With Flyway Contract Type 3: Database Contracts With Flyway Database schemas are contracts, too. Flyway migrations let you version and evolve schemas with the same contract-first approach. Base Schema Migration File: V1__create_orders.sql SQL CREATE TABLE orders ( id VARCHAR(32) PRIMARY KEY, customer_id VARCHAR(32) NOT NULL, status VARCHAR(16) NOT NULL, created_at TIMESTAMP NOT NULL DEFAULT now() ); CREATE TABLE order_items ( order_id VARCHAR(32) NOT NULL REFERENCES orders(id), sku VARCHAR(64) NOT NULL, quantity INT NOT NULL CHECK (quantity > 0), PRIMARY KEY (order_id, sku) ); CREATE INDEX idx_orders_customer_id ON orders(customer_id); Schema Evolution: Expand/Migrate/Contract When you need to add a field without breaking old code, use the expand/migrate/contract pattern: Expand (V2__add_order_source.sql): SQL ALTER TABLE orders ADD COLUMN source VARCHAR(32); Migrate (application code): SQL -- Backfill for existing rows UPDATE orders SET source = 'UNKNOWN' WHERE source IS NULL; Contract (V3__enforce_source_not_null.sql): SQL -- After all systems produce source ALTER TABLE orders ALTER COLUMN source SET NOT NULL; This three-phase approach lets you deploy schema changes without downtime. CI/CD Enforcement: Making Contracts Real Contracts only work if they're enforced. Here are the CI gates that prevent breaking changes: REST API: OpenAPI Breaking Change Detection Run openapi-diff in CI to catch breaking changes: YAML # .github/workflows/api-contract-check.yml - name: Check for API breaking changes run: | npx openapi-diff \ main:contracts/openapi/orders-api.v1.yaml \ HEAD:contracts/openapi/orders-api.v1.yaml \ --fail-on-breaking This fails the build if you: Remove required fieldsChange field typesRemove endpointsChange HTTP status codes Kafka: Schema Registry Compatibility Check Configure Schema Registry to enforce backward compatibility: YAML # Schema Registry configuration confluent.schema.registry.url=http://schema-registry:8081 spring.kafka.properties.schema.registry.url=http://schema-registry:8081 # Enforce backward compatibility spring.kafka.producer.properties.auto.register.schemas=true spring.kafka.producer.properties.use.latest.version=true When you try to publish an incompatible schema, the producer fails at startup, before reaching production. Database: Flyway Validation Flyway validates migrations on startup: Properties files spring.flyway.validate-on-migrate=true spring.flyway.baseline-on-migrate=false If someone manually modified the database schema, Flyway detects the mismatch and fails the deployment. Integration Flow Diagram Figure 2: End-to-end flow showing REST → DB → Kafka integration with contracts enforced at each boundary Real-World Benefits: Production Implementation Experience After implementing contract-first across multiple services (orders, billing, inventory), teams consistently observe significant improvements: Integration quality: Substantial reduction in integration bugs reaching productionMost remaining bugs are business logic issues, not contract mismatchesClear contracts eliminate ambiguity in integration expectations Development velocity: Integration cycles measured in weeks rather than monthsConsumer teams start integration work immediately instead of waiting for provider completionMock servers enable realistic integration testing without coordination delays Operational reliability: CI-enforced contract validation prevents breaking changes from reaching productionBreaking changes caught during PR review, not after deploymentAutomated validation ensures compatibility before merge Documentation accuracy: Documentation generated from OpenAPI specs stays synchronized with implementation by designSwagger UI always reflects actual API behavior as both derive from the same contractNo manual documentation maintenance required Common Pitfalls and How to Avoid Them Pitfall 1: Treating Contracts as Documentation Wrong approach: Write code first, generate OpenAPI from annotations later.Right approach: Write OpenAPI first, generate server stubs, or validate implementation against it. Pitfall 2: Skipping CI Validation Wrong approach: Trust developers to manually check compatibility.Right approach: Automate breaking change detection in CI. Make it a required check. Pitfall 3: No Schema Evolution Strategy Wrong approach: Add fields without defaults, breaking old consumers.Right approach: All new fields must be optional or have defaults. Test with multiple schema versions. Pitfall 4: Ignoring Idempotency Wrong approach: Assume exactly-once delivery in Kafka.Right approach: Design consumers to deduplicate by eventId. Assume at-least-once. Pitfall 5: Coupling Contracts to Implementation Wrong approach: Expose internal database schema directly in API contracts.Right approach: Design contracts for consumers, not internal convenience. Use DTOs that map between the external contract and the internal model. When Contract-First Makes Sense Contract-first isn't always the answer. Use it when: ✅ Multiple teams integrate: The coordination cost justifies the upfront contract design effort✅ Public APIs or partner integrations: External consumers need stability and clear documentation✅ Microservices architecture: Services must evolve independently without breaking dependents✅ High change frequency: CI validation catches breaking changes early when changes are frequent Skip contract-first when: ❌ Single small team: Coordination overhead is low, so formal contracts add friction❌ Prototyping: You're exploring the problem space and expect major pivots❌ Internal tools with one consumer: The provider and consumer are maintained by the same person Key Takeaways Contracts enable parallel development: Provider and consumer teams work simultaneously instead of seriallyCI validation prevents breaking changes: Automate OpenAPI diffs, schema compatibility checks, and migration validationIdempotency is not optional: At-least-once delivery in Kafka requires consumers to deduplicate by event IDSchema evolution requires backward compatibility: Use nullable fields with defaults to support gradual rolloutsContracts are design specifications, not afterthoughts: Define contracts first, generate code from them What's Next? If you're implementing contract-first integration: Start with one integration, pick your most painful cross-team dependencyWrite the OpenAPI spec or Avro schema first, before any implementationSet up CI validation for that contract (openapi-diff or Schema Registry)Measure reduction in integration bugs and coordination overheadExpand to other integrations once you've proven the pattern Contract-first isn't a silver bullet, but it transforms integration from a coordination nightmare into a governance problem that CI can solve. When you're coordinating three teams across two time zones, that shift makes the difference between shipping in weeks versus months. Full source code: github.com/wallaceespindola/contract-first-integrations Related reading: OpenAPI 3.0 Specification: spec.openapis.orgConfluent Schema Registry: docs.confluent.io/platform/current/schema-registryFlyway Documentation: flywaydb.org/documentationMastering Contract-First API Development: moesif.com/blogResearch on Microservices Issues: arxiv.org Need more tech insights? Check out my GitHub, LinkedIn, and Speaker Deck. Happy coding!

By Wallace Espindola

Building a Zero-Cost Approval Workflow With AWS Lambda Durable Functions

When AWS announced Lambda Durable Functions at re: Invent 2025, my first reaction was, "Okay, but how is this different from Step Functions?" I have been building serverless workflows on AWS for a while now, and Step Functions has always been my go-to service for orchestrating multi-step pipelines. So naturally, I wanted to put this new capability to the test. I decided to build a simple document processing workflow, an ETL pipeline with human-in-the-loop approval using both Durable Functions and Step Functions, then run 1,000 actual document processing workflows through each system. What I found surprised me. Not just the cost difference (79% cheaper with Durable Functions), but the trade-offs that nobody is really talking about yet. In this tutorial, I will walk you through building a zero-cost approval workflow using Lambda Durable Functions with Python. Along the way, I will share the actual cost numbers and the lessons that would have saved me a few hours of debugging. The Problem: Approval Workflows Are Expensive If you have ever built a document processing system that requires human approval, you know the pain. Someone uploads a file, your system processes it, and then... it sits there. Waiting for a human to review and approve it. That wait can be 5 minutes, 20 minutes, or even hours. Traditional approaches to handling this waiting are: Polling: Your code keeps checking every 30 seconds — "Is it approved yet? How about now?" making those calls the entire time.Always-on server: An EC2 instance or ECS container sits idle, costing you money 24/7, just to catch that one approval event.External state management: You build a custom solution with DynamoDB, SQS, and Lambda triggers — works fine, but it requires you to maintain a state machine you built yourself. What if your workflow could just... pause? No compute charges. No polling. Just pause, wait for the human to do their thing, and resume exactly where it left off. That is exactly what Lambda Durable Functions enables with the wait_for_callback pattern. What We Are Building Here is the workflow we will implement: Extract data → Transform data → Load data → Wait for approval (≈20 min) → Finalize & archive A CSV file gets uploaded to an S3 bucket under the uploads/ prefix. Our durable function picks it up, runs it through three ETL steps (extract, transform, load), then pauses execution and waits for a human to approve the processed data through a shared approval API. Once approved, the function resumes, finalizes the job, and archives the file. The key part? During that 20-minute (or 2-hour, or 2-day) approval wait, you pay absolutely nothing for compute. Architecture Overview The project uses three separate SAM stacks: Markdown shared-resources/ # Approval API, DynamoDB, SNS (shared by both systems) durable-functions/ # Lambda Durable Functions ETL pipeline step-functions/ # Step Functions ETL pipeline (for comparison) The shared approval handler serves for both workflow types using a single API. When a job comes in for approval, it checks the workflowType field, and if it is durable-functions, it calls send_durable_execution_callback_success.If step-functions, it calls send_task_success. Same API endpoint, different callback mechanisms under the hood. Prerequisites Before we begin, make sure you have the following: AWS SAM CLI (latest version recommended)Python 3.14 runtime AWS account with Lambda, DynamoDB, S3, SNS, and API Gateway accessDocker for local Lambda testing Check your SAM CLI version: Markdown sam --version Step 1: Deploy Shared Resources First Before the ETL pipeline, we need the shared infrastructure — the approval API, DynamoDB table for pending approvals, and SNS topic for notifications. Here is the shared-resources/ SAM template: YAML # shared-resources/template.yaml AWSTemplateFormatVersion: '2010-09-09' Transform: AWS::Serverless-2016-10-31 Description: Shared resources for ETL approval workflow Parameters: ApproverEmail: Type: String Description: Email address to receive approval notifications Default: [email protected] Resources: PendingApprovalsTable: Type: AWS::DynamoDB::Table Properties: TableName: etl-pending-approvals BillingMode: PAY_PER_REQUEST AttributeDefinitions: - AttributeName: jobId AttributeType: S KeySchema: - AttributeName: jobId KeyType: HASH TimeToLiveSpecification: AttributeName: ttl Enabled: true ApprovalNotificationTopic: Type: AWS::SNS::Topic Properties: TopicName: etl-approval-notifications Subscription: - Endpoint: !Ref ApproverEmail Protocol: email ApprovalApi: Type: AWS::Serverless::Api Properties: Name: ETL-Approval-API StageName: prod ApprovalHandlerFunction: Type: AWS::Serverless::Function Properties: FunctionName: ETL-Approval-Handler CodeUri: ./src Handler: approval_handler.handler Runtime: python3.14 MemorySize: 256 Timeout: 30 Environment: Variables: APPROVALS_TABLE: !Ref PendingApprovalsTable Policies: - DynamoDBCrudPolicy: TableName: !Ref PendingApprovalsTable - Version: '2012-10-17' Statement: - Effect: Allow Action: - states:SendTaskSuccess - states:SendTaskFailure - lambda:SendDurableExecutionCallbackSuccess - lambda:SendDurableExecutionCallbackFailure Resource: '*' Events: ApproveJob: Type: Api Properties: RestApiId: !Ref ApprovalApi Path: /approve/{jobId} Method: POST RejectJob: Type: Api Properties: RestApiId: !Ref ApprovalApi Path: /reject/{jobId} Method: POST GetJobStatus: Type: Api Properties: RestApiId: !Ref ApprovalApi Path: /status/{jobId} Method: GET Notice the approval handler has permissions for both states:SendTaskSuccess (Step Functions) and lambda:SendDurableExecutionCallbackSuccess (Durable Functions). This is the shared handler approach, one API that works with both workflow types. Deploy it: Markdown cd shared-resources sam build sam deploy --guided Step 2: The Durable Functions SAM Template Now the ETL pipeline itself for the Duration Functions. The key addition is the DurableConfig property. The DurableConfig property tells Lambda to enable durable execution for your function. YAML # durable-functions/template.yaml AWSTemplateFormatVersion: '2010-09-09' Transform: AWS::Serverless-2016-10-31 Description: Lambda Durable Functions ETL Pipeline Globals: Function: Runtime: python3.14 Architectures: - arm64 MemorySize: 512 Timeout: 900 Resources: ETLOrchestratorFunction: Type: AWS::Serverless::Function Properties: FunctionName: ETLDurableOrchestrator CodeUri: ./src Handler: handlers/etl_handler.lambda_handler MemorySize: 1024 Timeout: 900 DurableConfig: ExecutionTimeout: 86400 # 24 hours for human approval RetentionPeriodInDays: 14 # Keep execution history for debugging AutoPublishAlias: live Policies: - AWSLambdaBasicExecutionRole - S3CrudPolicy: BucketName: !Sub "${RawBucketName}-${AWS::AccountId}" - S3CrudPolicy: BucketName: !Sub "${ProcessedBucketName}-${AWS::AccountId}" - DynamoDBCrudPolicy: TableName: !Ref ETLMetadataTable - DynamoDBCrudPolicy: TableName: etl-pending-approvals - SNSPublishMessagePolicy: TopicName: etl-approval-notifications Events: S3Upload: Type: S3 Properties: Bucket: !Ref RawDataBucket Events: s3:ObjectCreated:* Filter: S3Key: Rules: - Name: prefix Value: uploads/ - Name: suffix Value: .csv Environment: Variables: PROCESSED_BUCKET: !Sub "${ProcessedBucketName}-${AWS::AccountId}" METADATA_TABLE: !Ref ETLMetadataTable APPROVALS_TABLE: etl-pending-approvals APPROVAL_TOPIC_ARN: !ImportValue ETL-ApprovalTopicArn APPROVAL_API_URL: !ImportValue ETL-ApprovalApiUrl A few things to notice here: MemorySize: 1024 on the orchestrator (overrides the 512 MB global default). Since this single function does all the work, it needs more memory.ExecutionTimeout: 86400 – This is the total workflow duration across all invocations (24 hours). The standard Timeout: 900 is the per-invocation limit (15 minutes). Each checkpoint/resume is a fresh invocation.AutoPublishAlias: live – AWS recommends using Lambda versions with durable functions. If you update code while an execution is suspended, replay will use the version that started the execution.S3 filter with prefix: uploads/ and suffix: .csv – Only CSV files under the uploads/ directory trigger the workflow.The stack imports shared resources via !ImportValue the approval table, SNS topic, and API URL from the shared stack. Step 3: Writing the Durable Function This is where it gets interesting. The entire ETL pipeline, including the approval wait, lives in a single Lambda function. No state machine definition. No JSON DSL. Just Python code. First, the individual ETL steps. Each one is a regular Python function in a separate file: Extract Python import csv import io import boto3 import logging logger = logging.getLogger() s3_client = boto3.client("s3") def extract_data(source_bucket, source_key, step_context=None): logger.info(f"Extracting from s3://{source_bucket}/{source_key}") response = s3_client.get_object(Bucket=source_bucket, Key=source_key) content = response["Body"].read().decode("utf-8") reader = csv.DictReader(io.StringIO(content)) records = list(reader) schema = { "columns": reader.fieldnames, "source_file": source_key, "file_size_bytes": response["ContentLength"] } logger.info(f"Extracted {len(records)} records with {len(schema['columns'])} columns") return {"data": records, "record_count": len(records), "schema": schema} Transform Python import logging from datetime import datetime logger = logging.getLogger() def transform_data(raw_data, schema_config, step_context=None): logger.info(f"Transforming {len(raw_data)} records") valid_records, rejected_records = [], [] for i, record in enumerate(raw_data): try: cleaned = {k: v.strip() if isinstance(v, str) else v for k, v in record.items()} if not cleaned.get("id") or not cleaned.get("name"): rejected_records.append({"index": i, "reason": "Missing required field"}) continue if "date" in cleaned: cleaned["date"] = normalize_date(cleaned["date"]) cleaned["_processed_at"] = datetime.utcnow().isoformat() for key in ["amount", "quantity", "price"]: if key in cleaned and cleaned[key]: try: cleaned[key] = float(cleaned[key]) except ValueError: cleaned[key] = None valid_records.append(cleaned) except Exception as e: rejected_records.append({"index": i, "reason": str(e)}) return { "data": valid_records, "valid_records": len(valid_records), "rejected_records": len(rejected_records), "rejection_details": rejected_records[:100] } def normalize_date(date_str): for fmt in ["%Y-%m-%d", "%m/%d/%Y", "%d-%m-%Y", "%Y/%m/%d"]: try: return datetime.strptime(date_str, fmt).strftime("%Y-%m-%d") except ValueError: continue return date_str Load Python import json import boto3 import logging logger = logging.getLogger() s3_client = boto3.client("s3") def load_data(transformed_data, target_bucket, target_key, step_context=None): logger.info(f"Loading {len(transformed_data)} records to s3://{target_bucket}/{target_key}") output_lines = "\n".join(json.dumps(r) for r in transformed_data) s3_client.put_object( Bucket=target_bucket, Key=target_key, Body=output_lines.encode("utf-8"), ContentType="application/jsonl", Metadata={"record_count": str(len(transformed_data))} ) summary = { "record_count": len(transformed_data), "columns": list(transformed_data[0].keys()) if transformed_data else [], "sample_records": transformed_data[:3] } return {"target_path": f"s3://{target_bucket}/{target_key}", "record_count": len(transformed_data), "summary": summary} Notice the steps are plain Python functions — no special decorator, no SDK import. They take step_context=None as an optional last parameter, which keeps them testable outside the durable execution context. Now the main ETL orchestrator that ties it all together: Python import json import os import logging from datetime import datetime from aws_durable_execution_sdk_python import durable_execution, DurableContext from steps.extract import extract_data from steps.transform import transform_data from steps.load import load_data from steps.finalize import finalize_job logger = logging.getLogger() logger.setLevel(logging.INFO) PROCESSED_BUCKET = os.environ.get("PROCESSED_BUCKET") METADATA_TABLE = os.environ.get("METADATA_TABLE") @durable_execution def lambda_handler(event, context: DurableContext): # Handle both S3 event format and direct invocation if "Records" in event: s3_event = event["Records"][0]["s3"] source_bucket = s3_event["bucket"]["name"] source_key = s3_event["object"]["key"] else: source_bucket = event.get("bucket") source_key = event.get("key") # Generate job_id deterministically using context.step() job_id = context.step( lambda _: f"etl-durable-{datetime.utcnow().strftime('%Y%m%d%H%M%S')}-" f"{source_key.split('/')[-1]}", name="generate-job-id" ) context.logger.info(f"Starting ETL job: {job_id}") # Step 1: Extract extracted = context.step( lambda _: extract_data(source_bucket, source_key, None), name="extract-data" ) context.logger.info(f"Extracted {extracted['record_count']} records") # Step 2: Transform transformed = context.step( lambda _: transform_data(extracted["data"], extracted.get("schema", {}), None), name="transform-data" ) # Step 3: Load loaded = context.step( lambda _: load_data(transformed["data"], PROCESSED_BUCKET, f"processed/{job_id}/output.jsonl", None), name="load-data" ) # --- EXECUTION PAUSES HERE --- # The submitter function stores the callback_id in DynamoDB # and sends an SNS notification to the reviewer. # No compute charges while waiting for approval. def submit_for_approval(callback_id: str, ctx): return notify_reviewer(job_id, callback_id, loaded["summary"]) approval = context.wait_for_callback( submitter=submit_for_approval, name="quality-check-approval" ) # Parse approval result if isinstance(approval, str): approval = json.loads(approval) if not approval or not approval.get("approved"): return {"status": "REJECTED", "job_id": job_id, "reason": approval.get("reason", "No reason")} # Step 4: Finalize (only runs after approval) final = context.step( lambda _: finalize_job(job_id, source_bucket, source_key, loaded, approval, METADATA_TABLE, None), name="finalize-job" ) return { "status": "COMPLETED", "job_id": job_id, "records_processed": transformed["valid_records"], "output_path": loaded["target_path"], "approved_by": approval.get("reviewer"), "completed_at": final["completed_at"] } Let me break down the important parts: @durable_execution – This decorator (imported from aws_durable_execution_sdk_python) enables the checkpoint/replay mechanism on the handler.context.step(lambda _: ..., name="...") – Each step call creates a checkpoint. On replay, completed steps return their cached results instantly instead of re-executing.context.wait_for_callback(submitter=..., name="...") – This is the zero-cost waiting magic. The submitter function receives a callback_id which gets stored in DynamoDB. Execution then pauses completely — Lambda saves the state, shuts down, and you stop paying.Determinism matters – Notice job_id is generated inside a context.step(). That is intentional. Since Lambda replays your function from the beginning on resume, datetime.utcnow() would produce a different value on each replay. Wrapping it in a step ensures the timestamp gets checkpointed and replayed consistently. The notify_reviewer function (in the same file) stores the callback details in DynamoDB and sends an SNS notification: Python def notify_reviewer(job_id, callback_id, summary): import boto3 from datetime import timedelta dynamodb = boto3.resource('dynamodb') sns_client = boto3.client('sns') approvals_table = os.environ.get('APPROVALS_TABLE', 'etl-pending-approvals') approval_topic_arn = os.environ.get('APPROVAL_TOPIC_ARN') approval_api_url = os.environ.get('APPROVAL_API_URL') table = dynamodb.Table(approvals_table) ttl = int((datetime.utcnow() + timedelta(hours=24)).timestamp()) table.put_item(Item={ 'jobId': job_id, 'callbackId': callback_id, 'functionArn': os.environ.get('AWS_LAMBDA_FUNCTION_NAME'), 'workflowType': 'durable-functions', 'summary': json.dumps(summary), 'status': 'pending', 'requestedAt': datetime.utcnow().isoformat(), 'ttl': ttl }) if approval_topic_arn: sns_client.publish( TopicArn=approval_topic_arn, Subject=f'ETL Job Approval Required: {job_id}', Message=f"Job ID: {job_id}\n" f"Approve: POST {approval_api_url}/approve/{job_id}\n" f"Reject: POST {approval_api_url}/reject/{job_id}" ) return {"job_id": job_id, "callback_id": callback_id, "status": "pending"} The workflowType: 'durable-functions' field is important — it tells the shared approval handler which callback mechanism to use when the reviewer responds. Step 4: The Shared Approval Handler When the reviewer clicks approve, the shared handler looks up the callbackId from DynamoDB and sends the callback to the paused durable execution: Python # shared-resources/src/approval_handler.py (key excerpt) if workflow_type == 'durable-functions': callback_id = approval_record.get('callbackId') if approved: lambda_client.send_durable_execution_callback_success( CallbackId=callback_id, Result=json.dumps(approval_response) ) else: lambda_client.send_durable_execution_callback_failure( CallbackId=callback_id, Error='JobRejected', Cause=reason or 'Job rejected by reviewer' ) elif workflow_type == 'step-functions': task_token = approval_record.get('taskToken') if approved: stepfunctions.send_task_success( taskToken=task_token, output=json.dumps(approval_response) ) Same API, same reviewer experience — the underlying callback mechanism is the only thing that differs. Step 5: Deploy and Test Deploy in order (shared resources first, since the other stacks import from it): Markdown # 1. Deploy shared resources cd shared-resources sam build && sam deploy --guided # 2. Deploy Durable Functions cd ../durable-functions sam build && sam deploy --guided Generate test data: Markdown python scripts/generate_test_data.py --count 10 --output test-data/ Upload files to trigger the workflow (note the uploads/ prefix — the S3 filter requires it): Markdown aws s3 cp test-data/ s3://etl-raw-data-bucket-YOUR_ACCOUNT_ID/uploads/ --recursive Check approval status and approve: Markdown # Check status curl https://<api-id>.execute-api.us-east-1.amazonaws.com/prod/status/<job-id> # Approve curl -X POST https://<api-id>.execute-api.us-east-1.amazonaws.com/prod/approve/<job-id> \ -H "Content-Type: application/json" \ -d '{"reviewer": "harpreet", "reason": "Data looks good"}' For bulk approvals during testing, the repo includes a handy script: Markdown ./scripts/approve_all_jobs.sh For local testing, the testing SDK supports pytest: Markdown pip install aws-lambda-durable-execution-sdk-testing pytest durable-functions/tests/ Step 6 (Optional): Deploy Step Functions for Comparison If you want to reproduce my full comparison, deploy the Step Functions stack too: Markdown cd step-functions sam build && sam deploy --guided Here is what the same workflow looks like in Amazon States Language: JSON { "StartAt": "ExtractData", "States": { "ExtractData": { "Type": "Task", "Resource": "${ExtractFunctionArn}", "ResultPath": "$.extractResult", "Next": "TransformData" }, "TransformData": { "Type": "Task", "Resource": "${TransformFunctionArn}", "ResultPath": "$.transformResult", "Next": "LoadData" }, "LoadData": { "Type": "Task", "Resource": "${LoadFunctionArn}", "ResultPath": "$.loadResult", "Next": "WaitForApproval" }, "WaitForApproval": { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken", "Parameters": { "FunctionName": "${ApprovalFunctionArn}", "Payload": { "taskToken.$": "$$.Task.Token", "jobId.$": "$.loadResult.job_id", "summary.$": "$.loadResult.summary" } }, "TimeoutSeconds": 86400, "ResultPath": "$.approvalResult", "Next": "CheckApproval" }, "CheckApproval": { "Type": "Choice", "Choices": [{ "Variable": "$.approvalResult.approved", "BooleanEquals": true, "Next": "FinalizeJob" }], "Default": "JobRejected" }, "JobRejected": { "Type": "Pass", "Result": { "status": "REJECTED" }, "End": true }, "FinalizeJob": { "Type": "Task", "Resource": "${FinalizeFunctionArn}", "End": true } } } Compare the two approaches. Durable Functions: one Python file, one Lambda, familiar programming constructs. Step Functions: a JSON state machine definition, five separate Lambda functions, plus the ASL learning curve. Both do the same thing. The Real Cost Numbers Now, here is the part that made me rebuild a mental model I had about serverless orchestration costs. I ran 1,000 CSV files through this exact workflow — both with Durable Functions and with the Step Functions implementation. The approval wait averaged about 20 minutes per document. Cost ComponentDurable FunctionsStep FunctionsDifferenceLambda invocations$0.000358$0.001-64%Lambda duration$0.0308$0.0179+72%State transitions$0.000$0.175-100%DynamoDB$0.003$0.0030%S3 operations$0.010$0.0100%TOTAL$0.044$0.207-79% Source: AWS CloudWatch Metrics The total cost, which is 79% cheaper, is mainly driven almost entirely by one thing: state transitions. Step Functions charges $0.025 per 1,000 state transitions. ASL workflow has 7 states (ExtractData, TransformData, LoadData, WaitForApproval, CheckApproval, JobRejected/FinalizeJob). For 1,000 workflows, that is 7,000 transitions, which costs $0.175. That single line (state transition) item is 84% of the total Step Functions cost. Durable Functions eliminates state transition costs. The trade-off? Higher Lambda duration costs ($0.031 vs. $0.018) because the durable function runs with 1,024 MB memory (single function handling all work) compared to Step Functions using 512 MB per function across five smaller functions. At scale, the difference adds up quickly: Daily VolumeDurable Functions/yearStep Functions/yearAnnual Savings1,000/day$16.06$75.56$59.5010,000/day$160.60$755.60$595100,000/day$1,606$7,556$5,950 And the most important validation: both systems achieved $0 compute cost during the 20-minute approval wait. That is the real game-changer compared to polling or always-on servers. Understanding the Replay Model One thing that confused me initially was the invocation count. I expected 1,000 invocations for 1,000 workflows. Instead, I got 1,788. Here is why. The checkpoint/replay model means each workflow requires a minimum of 2 invocations: Initial invocation — S3 trigger fires, function runs generate-job-id → extract → transform → load → submit-for-approval → pauseResume invocation — Callback received, function replays from the beginning (all completed steps return cached results instantly), then executes the finalize step So the theoretical minimum is 2,000 invocations for 1,000 workflows. The actual number was 1,788 because some workflows were still pending approval when I collected the metrics over the 24-hour measurement window. The important thing to remember: your code must be deterministic. Since Lambda replays your function from the beginning on resume, any non-deterministic operations (random numbers, timestamps, external API calls) must happen inside context.step() blocks so their results get checkpointed. Python job_id = context.step( lambda _: f"etl-durable-{datetime.utcnow().strftime('%Y%m%d%H%M%S')}-" f"{source_key.split('/')[-1]}", name="generate-job-id" ) That is exactly why the job_id generation in our code uses context.step().Without it, the timestamp would change on every replay. Here are some other examples where your code must be deterministic and how to avoid that: Deterministic IsssuesWhy It BreaksSolutionMath.random()Different value on every replayWrap in context.step()Date.now()Time keeps moving forwardUse context.timestamp or wrap in a stepGlobal variablesMight change between replaysPass state through function argumentsExternal API callsNetwork is a lieAlways wrap in context.step()Iterating over Map or SetIteration order can vary by runtimeUse arrays or ensure stable ordering When Not to Use Durable Functions I want to be honest about the trade-offs, because this is not a "Durable Functions is better than everything" story. Choose Step Functions when: Visual debugging matters. The step function state machine execution graph is genuinely superior. You can see exactly which step failed, inspect the input/output of each state, and non-technical stakeholders can actually understand what the workflow is doing. With Durable functions, AWS did provide visual analysis, monitoring, and debugging as well, but its little more developer-friendly. Multi-service orchestration. Step Functions has 220+ native AWS service integrations. DynamoDB, SQS, SNS, ECS, and Glue without writing Lambda glue code. In our Step Functions implementation, the ASL connects directly to Lambda function ARNs with built-in retry policies. With Durable Functions, all integrations go through your Lambda code.Express Workflows apply. For short-duration (under 5 minutes), high-throughput workflows, Step Functions Express Workflows use a different pricing model that can be very competitive. Choose Durable Functions when: Cost optimization is the priority (79% savings at scale)Workflows are Lambda-centric (your logic lives in Lambda code anyway)You prefer writing orchestration in Python/TypeScript/over Amazon States Language. AWS just now released Lambda Duration functions with Java in developer preview.Your logic is complex, and dynamic programming language is preferred by developers over the declarative ASL. AWS recommends a hybrid approach: use Durable Functions for application-level logic within Lambda, and Step Functions for high-level orchestration across multiple AWS services. Concurrency Planning — A Quick Note One thing worth mentioning: Durable Functions consolidates your entire workflow into a single Lambda function (ETLDurableOrchestrator in our case). This means your Lambda concurrency quota directly limits how many workflows can run simultaneously. Step Functions distributes execution across five separate Lambda functions (Extract, Transform, Load, Approval, Finalize), spreading the concurrency demand. In practice, this means you should plan your Lambda concurrency quotas carefully when using Durable Functions. If you expect burst uploads of hundreds or thousands of files at once, set reserved concurrency appropriately for your workload. This applies to both services — the difference is just where the concurrency demand concentrates. Wrapping Up Lambda Durable Functions is a genuinely useful addition to the serverless toolkit. For a simple ETL pipeline with human-in-the-loop approval, it delivered 79% cost savings over Step Functions while achieving the same 100% success rate and zero-cost waiting. The code-first approach feels natural if you are already comfortable writing Lambda functions in Python, TypeScript, or Java. The wait_for_callback pattern for human approvals is clean and straightforward. And the cost savings are real, which is driven entirely by the elimination of state transition charges. That said, Step Functions remains the better choice when visual workflow representation, multi-service orchestration, or operational simplicity are your priorities. There is no universal winner here, and it depends on what your team values more. The complete implementation — both SAM stacks, shared approval infrastructure, test data generation scripts, bulk approval scripts, and detailed cost analysis — is available here: github.com/hsiddhu2/aws-lambda-durable-vs-stepfunctions. Clone it, deploy both implementations, run your own 1,000-file comparison, and see the numbers for yourself. The ~79% cost advantage held consistent for this workflow, but your number will vary based on workflow complexity and state count.

By Harpreet Siddhu

Building a DevOps-Ready Internal Developer Platform: A Hands-On Guide to Golden Paths, Self-Service, and Automated Delivery Pipelines

Editor’s Note: The following is an article written for and published in DZone’s 2026 Trend Report, Platform Engineering and DevOps: How Internal Platforms, Developer Experience, and Modern DevOps Practices Accelerate Software Delivery. The role of the enterprise developer has become more complex over time as organizations adopt new technologies and tools, often without retiring their old ones. Add high staff turnover and increasing time and cost pressure, and developers are confronted with charting their own path through the SDLC. The purpose of internal developer platforms (IDPs) is to create a win-win scenario that benefits developers and their organizations. In this tutorial, you’ll define one golden path for a backend service that covers service setup, deployment, observability, and guardrails end to end. Step 1: Define the Platform Product and First Golden Path Successful IDP efforts focus on end-to-end developer workflows: building a new interface, deploying an updated microservice, running a regression suite, or standing up an environment. Ideally, the whole workflow can be supported directly from your IDP as self-service. Once you have identified the workflow to support, you need to design the “golden path,” which parts you will standardize and what you expose as configuration. It’s important to get that balance right. Components that have to change often, like service accounts, interfaces, and sizing, should be configurable. Creating templates and patterns helps reduce variability between outputs, making it easier to roll out necessary patching and updates. For the first golden path, pick one high-value workflow that is common, repeatable, and easy to measure. We will use the deployment of our backend service to an integration test environment because it touches build, deployment, validation, and evidence capture in one flow. User adoption is the key to success. To measure, it’s important to track both user adoption, such as how often a workflow is triggered, and outcome metrics like the number of compliant application instances, percentage of deployment failures, and average deployment duration. Step 2: Design the Golden Path (Templates and Defaults) Next, we get to design the golden path. An important factor for the developer experience is to provide documentation with contextual guidance. This can be traditional how-to guides or more advanced features such as AI-enabled chatbots. The documentation should explain how testing, application deployments, and other lifecycle activities happen along the golden path, and provide architectural guidance on embedding any newly developed capability in the existing architecture. Standards and governance are other aspects that should be available for self-service, including naming conventions, common libraries, and reusable services. On the technical side, the golden path should cover at least the following: Code repo and standard branching structureSkeleton code based on coding standards (e.g., environment config file, logging framework, data layer)CI/CD pipeline into an ephemeral cloud environment, or pointed at a standard persistent dev environmentSkeleton quality gates in the CI/CD pipeline (e.g., unit test, functional regression, security scan)Access to common utilities; injection of environment values (e.g., URLs, IP addresses, access and secrets management)Ability to spin up the environment (if cloud based) And lastly, the IDP needs to be designed with intuitive naming, a search function, tagging methods, and a hierarchical browsing structure so users can easily find the appropriate golden path. Supporting multiple ways of discovery provides a more resilient interface and eases the adoption of new golden path templates as they become available. For our backend service, choosing the workflow will show a representation of the steps included. Step 3: Wire Self-Service Workflows (Without Tickets) Besides golden path templates, IDPs should aim to be a one-stop shop for developers, so common requests should be available for self-service. Your existing ticket/ITSM systems can be a good source for creating the backlog. Identify the most common requests and start automating them in priority order. In many cases, a ticket continues to be useful even in the self-service model for tracking and approvals, which can be integrated into the automatic workflow. Approvals should be provided automatically based on defined criteria, and only require human approvals when the request is outside of those parameters, such as access to restricted data, use of expensive resources, and non-standard requests. Over time, developers should be able to request new features through a transparent feature backlog and voting mechanism to engage the community. When creating new features, keep things common wherever possible and provide ways for users to tailor their requests. For example, the standard deployment process might define a step for secrets injection, but some teams will tailor the process to skip it as necessary. This approach has two advantages: It creates a common language and process across teams and reduces the work to build and maintain the IDP. Spending a bit more time up front to create customizability pays off over the medium and long term. For our backend service, the first service we define is deployment to the integrated test environment. Step 4: Standardize Delivery With CI/CD + GitOps + IaC in One Flow The principle of the golden path deployment process remains unchanged: You build a software artifact once, and you deploy it multiple times along the environment path. For our backend service, promotion should happen through a versioned change (think GitOps) to the desired environment state, so application version, infrastructure definition, and deployment evidence remain traceable together. In the build stage, code is prepared in any pre-compile steps, then compiled and packaged with all necessary configuration files. In the deployment process, environment variables are injected, and the package is deployed to the target environment, which is scripted as Infrastructure as Code. The validation itself is usually layered: a technical validation to confirm that the deployment was correct, functional regression of core functionality, and testing the new changes. This sequence is based on speed of feedback, which is important in an automated IDP service. When a validation check fails, the golden path needs to have defined failure behavior with clear steps to execute. Pipeline failures like a broken build, failed test, or policy violation will block progression automatically. If the environment is materially impacted, a rollback is automatically initiated. Only in rare cases should a human evaluation be required — for example, when the level of ambiguity is too high and impacts stakeholders who are using the environment. Some policy violations can be treated with time-bound exceptions, such as allowing a new security vulnerability in a non-production environment. This allows functional testing to continue while the team remediates the security vulnerability. Prior to going live, the exception would be removed so the security vulnerability doesn’t progress to production. These types of exceptions should be set to auto-expire to prevent them from being forgotten later. Golden Path Steps and Guardrails stepself-service actionguardrailevidence Build Trigger pipeline via check-in action in source control Code scan and unit test results Build log, composition scan result Promote to non-prod environment Merge to staging branch, promotion request Technical validation, regression test Test results Promote to prod Promotion request Approval and compliance check Approval and audit trail Rollback Automated trigger or manual request Post-rollback validation and regression test Test results Step 5: Bake in Operability for Observability and Day-2 Readiness IDPs reduce cognitive load and toil as solutions to common concerns are built in. This is especially true for the operational concerns. Each workflow and self-service feature creates the log files and traces for auditability. All code and configuration are driven from version control, and the metrics recorded provide insights into the outcomes and performance of the IDP. New operational initiatives, like introducing a software bill of materials, can be rolled out across all technologies that use the IDP. When done correctly, templates can be updated centrally, and the log files provide full auditability to identify where old versions are still in use, reducing the overall security exposure. The IDP governance model needs to define the ownership of templates and any inheritance rules. For instance, some teams will tailor the template by adding additional steps required for their technology. Alongside the IDP instrumentation, standard dashboards and alert definitions ship with the template, pre-wired to the appropriate ownership group. Who responds to what is documented, not assumed. Runbooks and escalation paths are stored in version control alongside the service itself so they evolve with the system rather than rotting in a forgotten wiki page. Our backend service will include the following with the golden path: Logs, metrics, and tracesAlertsRunbook linkOwnership metadata The final piece is the feedback loop. Incidents, near-misses, and recurring friction points are resolved and also used to help continuously improve the platform, first becoming a backlog item. Step 6: Add Guardrails and Governance Without Slowing Delivery The IDP should leverage approved templates where possible and embed basic compliance and policy checks in the workflows. Platform developers will receive immediate feedback on any problems they need to fix. When issue resolution requires a longer time, time-bound exceptions can be allowed. Along the environment path from development to production, the quality gates should become more restrictive as the software quality improves. For our backend service, we define security scanning prior to deployments, and we don’t accept any deviations from the corporate standard for it. We follow a simple block, warn, escalate paradigm. The goal is to address problems that teams can deal with immediately and provide enough time for more complex work. This balance allows work to flow at pace. It is important to version templates and workflows so you can track what is in use. When significant problems are identified with a version, you can use the IDP logs to find any items in use and replace them quickly. Having the right guardrails in place might feel restrictive but in fact reduces the amount of rework over time as there are fewer incidents. Fast feedback reduces the time it takes to resolve problems. Step 7: Measure Adoption, DevEx, and Platform ROI One of the key success factors for IDPs is having the ability to measure adoption (covered earlier), developer experience, and platform ROI (e.g., DORA, SPACE). This allows you to break down and distinguish between adoption measures and outcome metrics. Implementing these criteria in the platform from the beginning captures data systematically. Good adoption measures to start with: number of executed workflows, number and currency of templates, and number of active users. The following outcome metrics can also be used as part of the business case for IDPs: deployment failure rate, MTTR, incident volumes, number of tickets, and security vulnerabilities. The team managing the IDP should actively use the metrics together with captured feedback from the user base (e.g., feature requests) to prioritize the backlog. Executive dashboards should be implemented to provide accountability and increase support across the organization. A Minimal IDP You Can Scale Bringing it together, take the following actions to kick-start your internal developer platform: Choose a common and not too complex workflow for your first golden pathCreate the code repository and CI/CD pipelineDefine a self-service UI for the workflowEmbed quality gates, metrics, and operational tooling into the workflow Start with one workflow for one pilot team, prove the path, then extend to the next workflow or team. Don’t forget to engage with the pilot users to receive feedback and support adoption. If you want to dive deeper, explore the CNCF Platforms for Cloud-Native Computing whitepaper and Platform Engineering Maturity Model. This is an excerpt from DZone’s 2026 Trend Report, Platform Engineering and DevOps: How Internal Platforms, Developer Experience, and Modern DevOps Practices Accelerate Software Delivery.Read the Free Report

By Mirco Hering

CORE

Stateless JWT Auth Microservice Architecture With Spring Boot 3 and Redis Sentinel

In this article, I will discuss a highly available solution developed using Spring Boot 3 and Spring Security 6 to address the "centralized authentication method" problem frequently seen in modern microservice ecosystems. We are not simply moving to an "authorization service"; we are examining the cache-first pattern, which minimizes DB usage, and the Redis Sentinel enhancement, which guarantees system persistence. Why a Separate Authentication Service? While embedding security into each service is an option in microservices, I have always found it more logical to proceed with a centralized Auth service and API Gateway combination. DRY (Don't Repeat Yourself): Using token authentication logic in many services increases extra maintenance costs.Isolation: Business services focus only on business logic; they don't deal with "is this token valid?" questions.Performance: Thanks to the Redis connection, instead of going to the database with every request, we can resolve the validation via the cache in milliseconds. Plain Text [Client] ──► [API Gateway] ──► [Auth Service: validate token] │ (valid) ▼ [Backend Microservices] Cache-Focused Approach: Reducing Database Load In the classic workflow, every login request puts a load on the DB. With the cache-first approach, the process proceeds like this with a POST /auth/signin request: First, Redis is checked. If there is a valid and unexpired token for the user, it is replicated directly. In case of cache deficiency, AuthManager.authenticate() is activated, a DB query is sent, and a BCrypt check is performed. After a successful login, a token is generated with JJWT (HS256). This token is given to Redis with our changes and TTL (e.g., 24 minutes), and personal responses are converted. In this way, it protects our main database, especially in brute-force or high-intensity login password attacks. Plain Text POST /auth/signin │ ▼ ┌──────────────────────────────┐ │ Token exists in Redis? │──── YES ──► Return token (0 DB queries) └──────────────────────────────┘ │ NO ▼ ┌──────────────────────────────┐ │ AuthManager.authenticate() │ (DB query + BCrypt verification) └──────────────────────────────┘ │ ▼ ┌──────────────────────────────┐ │ Generate JWT (JJWT HS256) │ └──────────────────────────────┘ │ ▼ ┌──────────────────────────────┐ │ Write to Redis (TTL: 24 min)│ └──────────────────────────────┘ │ ▼ Return token Implementation Details User Entity and UserDetails Integration In most projects, unnecessary mappings are performed between the User asset and the UserDetails objects expected by Spring Security. To reduce complexity, the User Entity is directly derived from the UserDetails interface. This makes the code cleaner and makes it "native," as outlined by Spring Security. Java @Data @Builder @NoArgsConstructor @AllArgsConstructor @Entity @Table(name = "T_APP_USER") public class User implements UserDetails { @Id @GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "seq_user_gen") @SequenceGenerator(name = "seq_user_gen", sequenceName = "SEQ_APP_USER", allocationSize = 1) @Column(name = "idx") private Long idx; @Column(name = "firstname") private String firstName; @Column(name = "lastname") private String lastName; @Column(unique = true, name = "email") private String email; @Column(name = "accesskey") private String accessKey; // BCrypt-hashed @Column(name = "role") @Enumerated(EnumType.STRING) private Role role; @Override public Collection<? extends GrantedAuthority> getAuthorities() { return List.of(new SimpleGrantedAuthority(role.name())); } @Override public String getUsername() { return email; } @Override public String getPassword() { return accessKey; } @Override public boolean isAccountNonExpired() { return true; } @Override public boolean isAccountNonLocked() { return true; } @Override public boolean isCredentialsNonExpired() { return true; } @Override public boolean isEnabled() { return true; } } JWT Filter: The Gateway to Security The request to the system passes through the OncePerRequestFilter. Here, using JwtAuthenticationFilter, we parse the token in each request and populate the SecurityContext. By using the new SecurityFilterChain bean introduced with Spring Security 6, we have disabled CSRF and made session management completely stateless. Token Generation and Validation Java public interface JwtService { String extractUserName(String token); String generateToken(UserDetails userDetails); boolean isTokenValid(String token, UserDetails userDetails); } @Service public class JwtServiceImpl implements JwtService { @Value("${token.signing.key}") private String jwtSigningKey; // Base64-encoded secret key @Override public String extractUserName(String token) { return extractClaim(token, Claims::getSubject); } @Override public String generateToken(UserDetails userDetails) { return Jwts.builder() .setClaims(new HashMap<>()) .setSubject(userDetails.getUsername()) .setIssuedAt(new Date(System.currentTimeMillis())) .setExpiration(new Date(System.currentTimeMillis() + 1000 * 60 * 24)) .signWith(getSigningKey(), SignatureAlgorithm.HS256) .compact(); } @Override public boolean isTokenValid(String token, UserDetails userDetails) { final String userName = extractUserName(token); return userName.equals(userDetails.getUsername()) && !isTokenExpired(token); } private <T> T extractClaim(String token, Function<Claims, T> claimsResolver) { return claimsResolver.apply( Jwts.parserBuilder() .setSigningKey(getSigningKey()) .build() .parseClaimsJws(token) .getBody() ); } private boolean isTokenExpired(String token) { return extractClaim(token, Claims::getExpiration).before(new Date()); } private Key getSigningKey() { return Keys.hmacShaKeyFor(Decoders.BASE64.decode(jwtSigningKey)); } } High Availability: Redis Sentinel Using a single Redis instance means that the Auth service has a "Single Point of Failure." If Redis crashes, no one can access the system. This risk mitigation was achieved using Redis Sentinel. Thanks to the Sentinel structure: If the master node crashes, the dependent node is automatically promoted to master via failover. On the application side, we continuously manage these transitions using the Lettuce driver. Technical Stack and Requirements Redis Sentinel configuration: Java @Configuration public class RedisConfig { @Value("${spring.redis.sentinel.master}") private String master; @Value("${spring.redis.sentinel.nodes}") private String sentinelNodes; @Value("${spring.redis.password}") private String password; @Bean public RedisConnectionFactory redisConnectionFactory() { RedisSentinelConfiguration sentinelConfig = new RedisSentinelConfiguration() .master(master); for (String node : sentinelNodes.split(",")) { String[] hostPort = node.split(":"); sentinelConfig.sentinel(hostPort[0], Integer.parseInt(hostPort[1])); } sentinelConfig.setPassword(RedisPassword.of(password)); return new LettuceConnectionFactory(sentinelConfig); } } Plain Text yaml env: - name: spring.redis.sentinel.master valueFrom: secretKeyRef: name: redis-user-secret key: username - name: spring.redis.password valueFrom: secretKeyRef: name: redis-user-secret key: password Token cache service: Java @Service public class TokenCacheServiceImpl { private final RedisTemplate<String, String> redisTemplate; public TokenCacheServiceImpl(RedisTemplate<String, String> redisTemplate) { this.redisTemplate = redisTemplate; } public void cacheToken(String username, String token, long duration, TimeUnit unit) { redisTemplate.opsForValue().set(username, token, duration, unit); } @Cacheable(value = "tokens", key = "#username") public String getToken(String username) { return redisTemplate.opsForValue().get(username); } } Authentication service: signup and signin: Java @Service @RequiredArgsConstructor public class AuthenticationServiceImpl implements AuthenticationService { private final UserRepository userRepository; private final PasswordEncoder passwordEncoder; private final JwtService jwtService; private final AuthenticationManager authenticationManager; private final TokenCacheServiceImpl tokenCacheService; @Override public JwtAuthenticationResponse signup(SignUpRequest request) { var user = User.builder() .firstName(request.getFirstName()) .lastName(request.getLastName()) .email(request.getEmail()) .accessKey(passwordEncoder.encode(request.getAccessKey())) // BCrypt .role(Role.USER) .build(); userRepository.save(user); var jwt = jwtService.generateToken(user); return JwtAuthenticationResponse.builder().token(jwt).build(); } @Override public JwtAuthenticationResponse signin(SigninRequest request) { // 1. Check Redis cache first String cachedToken = tokenCacheService.getToken(request.getEmail()); if (cachedToken != null) { return JwtAuthenticationResponse.builder().token(cachedToken).build(); } // 2. If not cached, authenticate (DB + BCrypt) authenticationManager.authenticate( new UsernamePasswordAuthenticationToken(request.getEmail(), request.getAccessKey()) ); var user = userRepository.findByEmail(request.getEmail()) .orElseThrow(() -> new IllegalArgumentException("Invalid credentials.")); // 3. Generate token and write to Redis (24 min TTL) var jwt = jwtService.generateToken(user); tokenCacheService.cacheToken(request.getEmail(), jwt, 24, TimeUnit.MINUTES); return JwtAuthenticationResponse.builder().token(jwt).build(); } } JWT authentication filter: Java @Component @RequiredArgsConstructor public class JwtAuthenticationFilter extends OncePerRequestFilter { private final JwtService jwtService; private final UserService userService; @Override protected void doFilterInternal( @NonNull HttpServletRequest request, @NonNull HttpServletResponse response, @NonNull FilterChain filterChain ) throws ServletException, IOException { final String authHeader = request.getHeader("Authorization"); // Pass through if no Authorization header or doesn't start with Bearer if (StringUtils.isEmpty(authHeader) || !StringUtils.startsWith(authHeader, "Bearer ")) { filterChain.doFilter(request, response); return; } final String jwt = authHeader.substring(7); final String userEmail = jwtService.extractUserName(jwt); // Process only if SecurityContext has no authentication yet if (StringUtils.isNotEmpty(userEmail) && SecurityContextHolder.getContext().getAuthentication() == null) { UserDetails userDetails = userService.userDetailsService() .loadUserByUsername(userEmail); if (jwtService.isTokenValid(jwt, userDetails)) { SecurityContext context = SecurityContextHolder.createEmptyContext(); UsernamePasswordAuthenticationToken authToken = new UsernamePasswordAuthenticationToken( userDetails, null, userDetails.getAuthorities() ); authToken.setDetails(new WebAuthenticationDetailsSource().buildDetails(request)); context.setAuthentication(authToken); SecurityContextHolder.setContext(context); } } filterChain.doFilter(request, response); } } Spring Security 6 configuration: Java @Configuration @EnableWebSecurity @RequiredArgsConstructor public class SecurityConfiguration { private final JwtAuthenticationFilter jwtAuthenticationFilter; private final UserService userService; @Bean public SecurityFilterChain securityFilterChain(HttpSecurity http) throws Exception { http .csrf(AbstractHttpConfigurer::disable) // Stateless → no CSRF needed .authorizeHttpRequests(request -> request .requestMatchers("/auth/**").permitAll() // Auth endpoints open to all .anyRequest().authenticated() ) .sessionManagement(manager -> manager.sessionCreationPolicy(STATELESS) // No server-side session ) .authenticationProvider(authenticationProvider()) .addFilterBefore(jwtAuthenticationFilter, // JWT filter runs first UsernamePasswordAuthenticationFilter.class); return http.build(); } @Bean public PasswordEncoder passwordEncoder() { return new BCryptPasswordEncoder(); } @Bean public AuthenticationProvider authenticationProvider() { DaoAuthenticationProvider authProvider = new DaoAuthenticationProvider(); authProvider.setUserDetailsService(userService.userDetailsService()); authProvider.setPasswordEncoder(passwordEncoder()); return authProvider; } @Bean public AuthenticationManager authenticationManager(AuthenticationConfiguration config) throws Exception { return config.getAuthenticationManager(); } } Unit tests: Java @Test @DisplayName("Signin: if token is cached, should not query the DB") void testSignInWithCachedToken() { when(tokenCacheService.getToken(TEST_EMAIL)).thenReturn(TEST_TOKEN); JwtAuthenticationResponse response = authenticationService.signin( SigninRequest.builder().email(TEST_EMAIL).accessKey(TEST_PASSWORD).build() ); assertEquals(TEST_TOKEN, response.getToken()); verifyNoInteractions(authenticationManager); // No DB + BCrypt call should happen verifyNoInteractions(userRepository); } // Invalid token test — SecurityContext should remain empty @Test @DisplayName("With an invalid token, SecurityContext should remain empty") void testDoFilterInternalInvalidToken() throws Exception { when(request.getHeader("Authorization")).thenReturn("Bearer " + INVALID_TOKEN); when(jwtService.extractUserName(INVALID_TOKEN)).thenReturn(TEST_EMAIL); when(userService.userDetailsService()).thenReturn(userDetailsService); when(userDetailsService.loadUserByUsername(TEST_EMAIL)).thenReturn(userDetails); when(jwtService.isTokenValid(INVALID_TOKEN, userDetails)).thenReturn(false); jwtAuthenticationFilter.doFilterInternal(request, response, filterChain); verify(filterChain).doFilter(request, response); assertNull(SecurityContextHolder.getContext().getAuthentication()); } Summary and Conclusion With the purchasing architecture, not only a secure login screen; It has built an architecture that is extremely scalable, overcomes database bottlenecks with caching, and meets high availability (HA) standards. In particular, the modern architecture offered by Spring Boot 3 has made the security layer much more flexible. If you are starting a large-scale microservice project, you can design token management from the outset in this "stateless" and "cached" manner.

By Erkin Karanlık

Feature Flag Debt: Performance Impact in Enterprise Applications

Feature flags have become standard practice in enterprise applications, enabling teams to release code into production environments without exposing new features to users. As teams leverage feature flags to increase delivery velocity, technical debt accumulates. Left unchecked, this debt will slowly and silently impact application performance, maintainability, and developer productivity. What Is Feature Flag Debt? Feature flag debt occurs when feature flags are left in the codebase after they’ve served their purpose. The most common symptoms of feature flag debt include: Dead code Context switching for developers Feature flag debt can go unnoticed because it typically doesn’t cause broken features. As a result, developers are often reluctant to clean up flags so they can focus on developing new features. Impact on Performance Feature flag debt can have serious consequences for application performance. In front-end applications, this is often overlooked. Once a feature flag has been introduced into a codebase, it incurs a long-term cost every time the application is loaded in the browser. Larger JS bundles: Each feature flag adds logic to the application. When feature flags are not cleaned up, the associated code is typically not removed from the final bundled app. This means more code for users to download and more memory used on the client.Reduced execution speed in client-side rendering: The browser must download, parse, and evaluate the entire bundle, even if certain code paths are never executed. This leads to slower parsing, longer load times, and slower interaction time. Impact on Developer Productivity Feature flag debt also negatively impacts developer productivity. Imagine having to read through an if/else statement that checks a feature flag that will never be true. Developers frequently encounter this scenario when working with feature flags. New engineers, in particular, often struggle to know which feature flags are safe to ignore. Should they be commenting out this code? What if they need it later? Why Aren’t Feature Flags Cleaned Up? It should be standard practice to remove feature flags from the codebase once they’re no longer needed. However, they often become a long-term liability for the application for several reasons: Nobody takes responsibility for cleaning up flags.People are afraid to remove code.There are no tools to help automate the process.There’s always something more pressing to work on. We often don’t see a defined feature flag lifecycle, which leads to indefinite accumulation. Example of Feature Flag Debt For example, let’s take a look at how a feature would typically look when wrapped in a feature flag: JavaScript const isAIAgentsFeatureFlagEnabled = isFeatureEnabled('ai-agents'); if (isAIAgentsFeatureFlagEnabled) { // lines of code // Code to run when the feature flag is enabled } else { // lines of code // Code to run when the feature flag is disabled } When first implemented, this doesn’t look too bad. When this feature is rolled out to production, there’s still the safety net of keeping the original functionality should something go wrong. However, after the feature flag is turned on for everyone and the feature reaches general availability (GA), there is no reason to keep both pathways in the application. The application still ships both pieces of code in the bundle, but only one will ever execute at runtime. The else block now represents dead code that will not get executed, but still takes up space in the bundle and adds to code complexity. Manage and Eliminate Feature Flag Debt Organizations need to take measures to prevent feature flag debt from slowing down their applications. Defining a feature flag life cycle is a great place to start. By enforcing that each feature flag has a description, owner, status, and expiration date, the team can ensure flags aren’t left to become debt. Treat feature flags as temporary and not part of the application's core architecture. When the feature is in GA, remove the flag and delete any code paths that are no longer needed. This results in a cleaner, more maintainable, and performant codebase. JSON [ { "feature_flag_name": "ai-agents", "description": "Feature flag that will allow AI agents to assist users with workflows and provide suggestions", "owner": "architecture crew", "status": "GA", "expiration_date": "2026-12-31" }, { "feature_flag_name": "smart-checkout", "description": "Feature flag that will allow smart checkout features, including dynamic pricing, custom offers", "owner": "architecture crew", "status": "Dev", "expiration_date": "2026-12-31" }, { "feature_flag_name": "ai-agents-eval", "description": "Feature flag to allow the evaluation framework to execute tests against AI agents to determine how accurate they are", "owner": "agent evaluation crew", "status": "QA", "expiration_date": "2026-10-12" }, { "feature_flag_name": "experiment-recommendation-v2", "description": "Feature flag for experimenting v2 recommendation version", "owner": "agent evaluation crew", "status": "GA", "expiration_date": "2026-12-31" } ] Having the feature flags stored in a format similar to the above can help identify who to contact to clean up old flags. Performance Gains From Cleanup Removing unused feature flags reduces bundle size and eliminates unnecessary code execution, resulting in faster load times, improved rendering performance, and a cleaner codebase. Conclusion For most enterprise applications, feature flags aren’t the problem; it’s forgetting to take them down. As the application grows over time, old feature flags accumulate, which will silently bloat the bundle size, degrade performance, and clutter the code.

By Poornakumar Rasiraju

Docker Hardened Images Are Free Now — Here's What You Still Need to Build

The Problem Isn't the Image Hardened container images are no longer niche. Docker open-sourced major portions of the tooling behind Docker Hardened Images under Apache 2.0 in late 2025. Chainguard and Google's distroless variants sit in the same space. The pitch across all three: fewer packages, smaller attack surface, dramatically lower CVE counts. The pitch is accurate. It is also incomplete. Most container security failures are not image failures. They are governance failures: A team pushes a debug build to production. Admission control doesn't block it because the policy is in Audit mode, not Enforce.A six-month-old deployment keeps running an ancient image digest while the team patches newer builds. Nobody detects the drift.The platform team rotates signing keys. Old pipelines keep producing images signed with the revoked key. Admission still accepts them. Nobody notices for ninety days.A vendor pushes an updated base image under the same tag. CI rebuilds against the new digest. The new digest is unsigned. Production takes it. No alert fires. None of these are CVE failures. They are governance failures — gaps in how images are produced, attested, verified, and monitored. Swapping the base image to a hardened variant changes none of them. A signed-and-attested hardened image in a cluster that doesn't verify signatures is operationally equivalent to a signed Ubuntu image in that cluster: the signature is decorative. I recently worked on migrating a regulated production workload onto a hardened-image baseline. Lab 12 of my docker-security-practical-guide repository is a sanitized, reproducible distillation of what that work taught me. The short version: the value is in the control plane around the image, not the image itself. The Trust Control Plane in 60 Seconds In practice, the hardest part is not enabling hardened images. It is operating trustworthy deployments at scale without slowing engineers down. The operating model has three layers, joined by a feedback loop: Supply Chain layer – images are signed (cosign keyless against Fulcio), attested with an SBOM (syft + CycloneDX), and scanned for vulnerabilities (grype). The output: an image whose origin and contents are independently verifiable by anyone.Trust layer – an admission controller (Kyverno) verifies signatures and attestations before any pod is scheduled. The admission policy is the unit of governance: it encodes which signers, which attestations, and which constraints are required for a workload to start.Enforcement layer – continuous drift detection answers the question: admission can't: has the digest drifted since we admitted it? Has the signing key been revoked? Has a new unsigned workload landed via a controller that bypasses admission?Feedback loop – drift findings feed back into the supply chain: a drift event produces a rebuild; an admission rejection produces a ticket. Without the loop, the enforcement layer becomes an alerting backwater that engineers mute. FIGURE 1 — Trust control plane for cloud-native software supply chain security.The architecture separates supply chain generation, admission-time trust verification, and continuous runtime enforcement into independent layers connected through a feedback loop. The pattern is vendor-agnostic: any compatible signing, admission, and drift-detection components can fulfill these roles. The bottom line: a hardened image is one input to the supply chain layer. Without trust verification, it's indistinguishable from a regular image at deploy time. Without enforcement, untrusted images coexist with hardened images in the same cluster. Without the feedback loop, trust state drifts silently. Admission Control: Where Governance Gets Teeth The trust layer is where the control plane becomes operationally real. In the lab, Kyverno's verifyImages rule asserts that every image carries a cosign signature from an approved identity. Here's the core of the policy: YAML apiVersion: kyverno.io/v1 kind: ClusterPolicy metadata: name: require-signed-images spec: validationFailureAction: Enforce rules: - name: verify-cosign-keyless match: any: - resources: kinds: [Pod] verifyImages: - imageReferences: ["ghcr.io/opscart/*"] attestors: - entries: - keyless: subject: "https://github.com/opscart/*" issuer: "https://token.actions.githubusercontent.com" required: true The subject and issuer together define who is trusted. For DHI images, these values point to Docker's signing identity. For Chainguard, Chainguard's. The shape of the policy is identical in all cases — only the identity matcher changes. When someone deploys an unsigned image, the rejection is immediate and actionable: Shell $ kubectl run test --image=nginx:latest --restart=Never Error from server: admission webhook "validate.kyverno.svc-fail" denied the request: resource Pod/default/test was blocked due to the following policies require-trusted-registry: trusted-registries-only: 'validation error: Image must come from a trusted registry. Allowed: dhi.io/*.' FIGURE 2 — Kyverno admission webhook rejecting an nginx pod from an untrusted registry. Capture from terminal: kubectl run rejected-test --image=nginx:latest --restart=Never (with cluster up and policies applied). Catching an unsigned image at admission costs one re-run of kubectl apply. Catching the same workload running in production a week later costs a security ticket, an incident response, and possibly a regulatory disclosure conversation. Moving rejection earlier is the highest-leverage decision in the entire model. Phased Rollout: Audit Before Enforce In production, you don't flip everything to Enforce on day one. The lab uses a phased approach: the trusted-registry policy runs in Enforce mode (hard gate on image origin), while signature and SBOM verification policies run in Audit mode (log violations, don't block). This gives teams a migration runway: they can see which workloads would fail and fix them before the policies graduate to Enforce. The shift from Audit to Enforce is a single-field YAML change. Signing Your Supply Chain: Keyless Cosign The supply chain layer produces the artifacts that admission verifies. A common modern approach uses cosign with GitHub Actions OIDC for keyless signing — no private keys to manage, rotate, or leak. The mechanism: GitHub Actions mints a short-lived OIDC token at workflow time. Cosign exchanges it for an ephemeral certificate from Sigstore Fulcio, signs the image, and destroys the key immediately. The certificate records which workflow, on which repository, at which commit, produced the signature. The signature is logged in Sigstore Rekor's public transparency log. The lab's pipeline implements a full build → push → sign → attest → verify flow that fails closed if verification breaks. The lab's pipeline implements a full build → push → sign → attest → verify flow that fails closed if verification breaks. The complete workflow and run history is public. The important property is that anyone can independently verify the signed artifact. Shell cosign verify \ --certificate-identity-regexp \ "^https://github\.com/opscart/docker-security-practical-guide/ \.github/workflows/supply-chain-gate\.yml@.+$" \ --certificate-oidc-issuer \ "https://token.actions.githubusercontent.com" \ ghcr.io/opscart/docker-security-practical-guide/dhi-sample-app:latest Verification for ghcr.io/opscart/.../dhi-sample-app:latest -- The following checks were performed on each of these signatures: - The cosign claims were validated - Existence of the claims in the transparency log was verified offline FIGURE 3 — cosign verify succeeds for any reader, without shared secrets. Capture from terminal: run the cosign verify command above against the published image at ghcr.io. This is what "supply chain security" means in practice: not "we sign our images," but "our trust assertions are independently verifiable by anyone, against neutral infrastructure, without prior trust setup." The published image can be verified directly against the public artifact. Fleet Drift: The Problem Nobody Watches Admission is point-in-time. Production is continuous. The enforcement layer's job is to answer the questions that admission can't: has the digest drifted since we admitted it? Has a new unsigned workload landed via a controller that bypasses admission? The lab's E1 experiment runs a drift audit against a synthetic 12-service fleet mixing DHI, Docker Hub, internally-built, and abandoned images. The fleet is intentionally constructed with an explicit variation matrix — the numbers below describe the synthetic fleet's structure, not measurements from a deployed environment. In this synthetic fleet, unsigned services averaged 13.0 critical CVEs while signed-and-verified services averaged 0.0. The exact ratio will vary by environment, but the audit makes the trust gap continuously visible. FIGURE 4a — Fleet drift audit: signing state vs CVE correlation across the synthetic fleet. Capture from terminal: run ./experiments/E1-drift-observation/analyze-drift.py. Screenshot Sections 1–3 (Fleet Summary + Origin×Signing Correlation + Signing State → CVE Accumulation) FIGURE 4b — Remediation order: compliance-scope risk concentration and prioritized action queue. Same script output, Sections 4 + 7 (Compliance Scope Risk Concentration + Recommended Remediation Order) The ratio isn't the point — your fleet will produce different numbers. What the control plane provides is the continuous, attributable surfacing of whatever the ratio actually is, including cases where the supposed benefit of hardening is harder to defend. That honest feedback loop is what turns the audit from a compliance checkbox into a supply chain prioritization tool. The Substitution Test A useful test for whether you've found an architectural pattern or a vendor recipe: can you swap a major component and have everything else continue to work? For this architecture, the test is straightforward. The lab demonstrates three configurations: Docker Hardened Images (dhi.io), Chainguard Images (cgr.dev/chainguard), and a self-built Alpine base signed against a project-owned GitHub Actions OIDC identity. In all three, the Kyverno policy structure is identical. The drift audit runs unchanged. The SBOM verification runs unchanged. Edits are confined to the identity matcher and the image references. The implication: "Should we standardize on DHI or Chainguard?" is a commercial decision (pricing, catalog coverage, support), not an architectural one. The architectural decision is whether to operate the trust control plane at all. A team that has invested in the control plane has built portable institutional capability. A team that has invested in "we use DHI" has bought a product, and a future migration off DHI is a structural rewrite rather than a configuration update. Production Friction: What Actually Goes Wrong The model works. It is also not free. Here are the operational costs my team hit, documented in detail in the companion repo's TROUBLESHOOTING.md: No shell. Distroless hardened images don't include /bin/sh, curl, wget, cat, or ls. When an engineer pages at 2 AM and runs kubectl exec -it pod -- /bin/sh, the command fails. The remediation is kubectl debug with an ephemeral debug container attached to the pod's process namespace. Train your on-call rotation on kubectl debug before migration, not after. The lab's E5 experiment documents three debug patterns (ephemeral containers, dev-variant images in dev namespaces only, pre-built debug sidecars) with runbook scenarios for unreachable services, crashloops, and OOM kills. Migration is not a FROM line change. The default user is nonroot (UID 65532), not root. Library paths differ. pip install --user installs to /home/nonroot/.local, not /root/.local. Required system packages (ca-certificates, timezone data) that come for free in stock bases must be explicitly carried over. The lab's Dockerfile required three iterations before the build succeeded locally: shell-form RUN failed (no /bin/sh), then pip --user installed to the wrong path, then requirements.txt pinned package versions that didn't exist on PyPI. Each of these is a 30-second local fix — and a 5-minute GitHub Actions round-trip if you don't test locally first. Signature paths vary by vendor. DHI signatures resolve via registry.scout.docker.com, not at the image's own registry path. Kyverno handles this through the policy's repository field, but any custom verification tooling needs to know. Plan to audit verification code before migration. Kyverno has schema gotchas. rekor and ctlog blocks must be inside keys, not siblings. webhookTimeoutSeconds is capped at 30. mutateDigest: true is incompatible with validationFailureAction: Audit. PolicyException requires an explicit feature flag. Each of these cost me 30–60 minutes of debugging — they're in TROUBLESHOOTING.md, so they don't cost you the same. None of these are deal-breakers individually. All of them together are why migrations slip from "next quarter" to "abandoned after two months." Budget for friction. When This Is Overkill The investment's value scales with three factors: regulatory pressure (HIPAA, PCI-DSS, SOC 2 Type II, FDA 21 CFR Part 11), fleet size and heterogeneity (8+ clusters, dozens of teams pushing images), and blast radius (pharmaceutical patient data vs. internal dashboard). Concretely: pre-production tools, side projects, prototypes, and developer sandboxes do not need this. They benefit from a hardened base image (free) and should not be put behind the full trust control plane. The overhead of policy maintenance, key rotation, and drift remediation outstrips the risk reduction. For most workloads outside regulated production, the supply chain layer alone — sign and SBOM your builds — captures most of the available value at a fraction of the cost. Conclusion: Architecture Over Image Choice Hardened images are useful. The point of this article is that they are one component of a broader architectural pattern, and the security outcomes regulated teams want are properties of the pattern, not the component. A team that adopts hardened images without the surrounding pattern has made a real but limited improvement. A team that adopts the pattern with any reasonable image vendor — DHI, Chainguard, or a self-built base — has built portable institutional capability. The substitution test is the diagnostic: ask whether a future migration away from your current image vendor is a configuration edit or a structural rewrite. If it's the former, you have the pattern. If it's the latter, you have a product dependency. The companion repository at github.com/opscart/docker-security-practical-guide (tag v1.12.0) contains everything in this article: working Kyverno policies, a keyless-signed sample image you can pull and verify right now, fleet drift audits, and five hypothesis-driven experiments. The cosign verify command above works against the published artifact today. Spend the design effort on the pattern. The image will be replaceable. The governance is what survives vendor replacement. This article is adapted from a longer write-up on OpsCart, which includes the complete threat model, substitution-test configurations, and an extended troubleshooting log.

By Shamsher Khan

CORE

DevOps and Platform Engineering Readiness Checklist: Everything Needed for a Scalable, Secure, High-Velocity Delivery Platform

Editor’s Note: The following is an article written for and published in DZone’s 2026 Trend Report, Platform Engineering and DevOps: How Internal Platforms, Developer Experience, and Modern DevOps Practices Accelerate Software Delivery. High-performing engineering organizations don’t scale through heroics. They scale through repeatable platform capabilities backed by evidence. This checklist reflects the shift from tool‑centric DevOps to product‑oriented platform engineering, focused on scale, reliability, and developer outcomes. It is intended for platform teams, cloud architects, and engineering leaders building internal developer platforms (IDPs) that deliver consistency, velocity, and control. Architecture and Platform Foundations Establishing standardized, versioned platform foundations makes workloads deployable, observable, and scalable by default while preventing drift and reducing risk. Core platform primitives are standardized: identity, networking, compute, storage, and secretsStandard blueprints exist and are version-controlled for common workloads with clear evolution pathsInfrastructure is provisioned via reusable IaC modules with policy validationEnvironments and clusters follow consistent topology and access modelsNetworking and service communication follow secure, consistent patternsSecrets and configurations are centrally managed and injected securelyArchitectures define scalability mechanisms and fault boundariesResilience is built in through redundancy and failoverShared services are centrally managed with defined ownership and SLAsPlatform capabilities are versioned for backward compatibility Platform Ownership and Operating Model A product‑oriented operating model enables scale without slowing teams. Define clear ownership, interfaces, and governance so the platform evolves without becoming a delivery bottleneck. A dedicated platform team owns roadmap, usability, reliability, and adoptionOwnership boundaries are defined (platform standardizes; app teams own service logic)Platform capabilities are easy to discover and use (e.g., templates, workflows, golden paths)A structured intake and support model exists (e.g., requests, issues, exceptions)Standards are enforced with governed exceptionsPlatform success is measured through adoption and delivery outcomesUsage data and feedback drive continuous improvementCapabilities are versioned and evolved predictably Environments and Golden Paths Translate platform architecture into opinionated, self-service workflows driven by organizational standards that reduce complexity and enforce best practices by default. Golden paths are effective only when they are widely adopted. Environment conventions are standardized across naming, configuration, and accessEnvironment state is enforced through IaC/GitOps to prevent driftGolden paths provide curated, reusable templates for common workloadsSecurity, observability, and policy defaults are built into golden pathsGolden paths balance strong defaults with controlled flexibilitySelf-service workflows enable scaffolding, provisioning, and deploymentEnvironment lifecycle is automated across provisioning, promotion, and teardownDocumentation and onboarding are well integrated into workflowsAdoption is measured through usage and coverageFeedback and production learnings drive continuous evolution Pipelines and Release Reliability Standardize delivery pipelines so every change is validated, traceable, and safely releasable, making delivery more predictable and recoverable, not just faster. Pipelines follow a standardized flow: build, test, package, deploy, and promoteQuality, security, and policy checks are embeddedArtifact promotion across environments is controlled and consistentEach release produces traceable, auditable evidenceRollback and recovery paths are implemented and testedFailures provide fast, actionable diagnosticsReliability metrics are tracked (e.g., success rate, change failure, rollbacks)Release ownership and escalation paths are clearly defined Toolchain and Self-Service Automation Provide consistent self‑service automation through curated tools and embedded guardrails that reduce fragmentation, risk, and operational complexity. A unified developer point of entry exists through an IDP or developer portalStandard workflows exist for deployment, environment setup, and accessReusable modules and templates prevent copy-paste sprawl and reduce cognitive loadProvisioning and deployments are automated with guardrailsRBAC and approvals are embedded into automationHigh-risk actions require audited approvalsWorkflow reliability, usage, and failures are measuredAutomation evolves continuously based on usage and feedback Observability and Operability Embed observability and operational guardrails into self-service automation so systems are consistent, measurable, diagnosable, and operable by default. Logs, metrics, and traces are included by default through templates and golden pathsMinimum observability standards are enforced for promotionDashboards and alerts are preconfigured and actionableTelemetry supports debugging, capacity planning, and optimizationService health targets (e.g., SLOs) guide operationsOperational ownership is defined across on-call, escalation, and boundariesRunbooks guide incident response and recoveryIncident learnings feed platform and template improvements Reliability, Resilience, and Recovery Design for failure up front so systems fail safely, degrade gracefully, and recover predictably, proving resilience through recovery, not uptime alone. Architectures isolate failures to limit blast radiusDependencies are evaluated for availability and fallback strategiesResilience patterns are built in by default (e.g., retries, timeouts, circuit breakers, degradation)Non-critical features degrade without impacting core functionalityRecovery objectives are defined and validatedBackup and recovery mechanisms are implemented and testedRecovery is automated to minimize manual interventionGame days, chaos experiments, or failure drills are conducted to validate system behavior under stressReliability metrics are tracked and optimized (e.g., recovery time, failure rate) Security Guardrails and Governance Enforce security and compliance through codified guardrails embedded in delivery workflows, with continuous monitoring to improve security posture over time. Access follows least-privilege principlesSecrets are centrally managed and securely injectedPolicies are codified and enforced consistently through Policy as CodeSecurity controls are embedded in pipelines, including scanning and config checksHigh-risk actions require controlled approvalsExceptions are time-bound, tracked, and reviewedAll changes are auditable and traceableCompliance requirements map to enforceable controls Developer Experience, Adoption, and ROI Improve DevEx by reducing friction, driving platform adoption, and linking usage to measurable delivery outcomes and business impact. Developer experience is consistent across services and environments Platform abstracts common concerns (e.g., infra, security, observability) through standardized defaultsOnboarding to first deploy is fast and frictionlessDocumentation, examples, and enablement drive consistent adoptionPlatform and golden path adoption are measured through usage, onboarding, and coverageKey DevEx metrics are tracked (e.g., lead time, change failure rate, MTTR, time to first deploy)Workflow usability and reliability are continuously optimizedFeedback and usage data drive platform improvementsROI is measured through delivery outcomes (e.g., reduced toil, incidents, faster releases) Platform Engineering Maturity and Assessment Platform engineering maturity can be assessed across three practical stages that reflect the consistent application, adoption, and improvement of platform capabilities: Foundation focuses on baseline standardization, safety, and operability, with reusable capabilities in place but adoption still uneven.Scale enables reliable self‑service through guardrailed golden paths, improving delivery without increasing operational overhead.Optimize treats platform engineering as a strategic differentiator, using data‑driven decisions to continuously improve resilience, developer experience, cost efficiency, and measurable ROI. Use the Maturity Scoring Matrix to assess maturity across core platform engineering capabilities. Rate each category once, on a scale of 1–5, based on available evidence rather than aspiration. Overall maturity is determined by the dominant scoring pattern across the matrix, with higher maturity requiring consistent strength across Foundation, Scale, and Optimize. The progression bar maps scores from Ad Hoc to Strategic and groups them across the Foundation, Scale, and Optimize stages. Repeat the assessment periodically to identify gaps, track progress, and guide platform roadmap priorities. Conclusion Treat this checklist as a baseline gate and a recurring review mechanism, not a one-time exercise. High-performing platforms evolve through continuous refinement of architecture, automation, governance, and developer experience. Use it to identify gaps, strengthen golden paths, and align platform capabilities with measurable delivery outcomes. This is an excerpt from DZone’s 2026 Trend Report, Platform Engineering and DevOps: How Internal Platforms, Developer Experience, and Modern DevOps Practices Accelerate Software Delivery.Read the Free Report

By Josephine Eskaline Joyce

CORE

Beyond Partitioning and Z-Order: A Deep Dive into Liquid Clustering for Unity Catalog Managed Tables

Partitioning and Z-Ordering have long been fundamental techniques in Delta Lake for optimizing data layout and query performance. However, these methods require significant upfront design and ongoing maintenance and they often struggle to adapt to changing data and query patterns. Databricks Liquid Clustering introduced with Delta Lake 3.0 goes beyond traditional partitioning and Z-Order, offering a self-tuning, flexible approach to organizing data that is especially powerful for Unity Catalog managed tables. In this article, we’ll explore how Liquid Clustering works, how it compares to traditional methods, and how to implement it in Databricks Unity Catalog for improved performance and simpler data management. Recap: Partitioning and Z-Order Limitations Before diving into Liquid Clustering, it’s important to understand the challenges of conventional partitioning and Z-Ordering in large Delta Lake tables: Design Complexity & Rigidity: Choosing an optimal partitioning scheme is difficult and usually fixed. A static Hive-style partition strategy often demands careful upfront planning to avoid data skew and concurrency conflicts and it cannot easily adapt if query patterns change. Changing partition columns later means expensive data rewrites.Partition Explosion & Metadata Overhead: If you partition on high-cardinality columns or many levels, you may end up with too many small partitions. This proliferation of tiny files and directories increases metadata overhead and slows down query planning.Need for Additional Clustering (Z-Order): Z-Ordering is often applied on top of partitions to co-locate related data. While Z-Order can improve data skipping, it is expensive to maintain it requires heavy shuffle and rewrite jobs and does not handle concurrent writes well. In other words, Z-Ordering jobs can be lengthy and costly and must be re-run as new data arrives to maintain clustering.Manual Tuning & Maintenance: Both partitioning and Z-Order require continuous tuning. Data engineers must monitor query patterns and manually decide how to partition or when to re-Zorder. This ongoing maintenance is time-consuming and error-prone. In summary, traditional partitioning/Z-ordering yields performance benefits but at the cost of rigidity and operational overhead. This sets the stage for a more adaptive solution. What Is Liquid Clustering? Liquid Clustering is a new data layout strategy in Databricks Delta Lake designed to replace traditional partitioning and Z-Ordering for Delta tables. The name liquid signifies flexibility data is clustered by one or more columns in a way that can evolve over time without strict, static partitions. Key characteristics of Liquid Clustering include: Dynamic, Self-Tuning Layout: Instead of static partitions, data is dynamically clustered based on specified clustering keys. The table’s storage layout automatically adjusts to changing data and query patterns, incrementally clustering new data as it is written. This means the data layout flows with your workload.Simplicity in Key Selection: You choose a set of clustering columns based on query access patterns, typically the columns most commonly used in WHERE filters or joins. You don’t need to worry about column cardinality, order of keys or file size tuning the platform handles optimal file sizing and clustering internally. Even high-cardinality columns can be used effectively, which would be impractical as partition keys.Flexibility to Change Keys (No Rewrites): Perhaps the most revolutionary aspect is that clustering keys can be redefined without rewriting existing data files. If your query patterns shift, you can alter the clustering columns and the system will gradually reorganize data for the new keys. There’s no massive upfront cost of re-partitioning the entire dataset past data doesn’t need an immediate rewrite.Skew-Resistant & Efficient Storage: Liquid Clustering is designed to maintain balanced file sizes and avoid the pitfalls of skewed partitions. Under the hood, the data engine can combine or split clustering ranges to keep files at an optimal size.Reduced Maintenance Overhead: Because the data layout adapts automatically, the need for manual maintenance is drastically reduced. You no longer have to schedule regular Z-Ordering jobs or hand-tune partition schemes. Liquid Clustering, especially in its automatic mode, offloads these decisions to Databricks. Databricks recommends using Liquid Clustering for most new Delta tables going forward, especially for tables that are large, have high-cardinality filter columns, experience data skew, or have evolving access patterns. It simplifies data engineering by set it and forget it clustering. In fact, thousands of customers have already adopted it as of 2025, over 3,000 monthly customers were writing 200+ PB of data into Liquid Clustered tables. Liquid Clustering vs Traditional Methods Liquid Clustering addresses the limitations of partitions and Z-ordering in several ways: No Rigid Partition Boundaries: Unlike Hive partitions, liquid clustering can store a range of values in each data file. This fluid layout avoids issues like tiny partitions or unbalanced file sizes.Incremental and Low-Shuffle Clustering: New data is clustered as it’s ingested, without requiring a full table rewrite. When you enable clustering on a table, Databricks flags the table to cluster future writes according to the specified keys. Each new INSERT or MERGE automatically writes out files clustered on those keys, and small files are merged as needed. This incremental approach means no huge one-time sort jobs every time you add data. Maintenance operations like OPTIMIZE still play a role but they can operate more efficiently since the incoming data is already sorted/clustered on write. Notably, the OPTIMIZE command for a liquid-clustered table can be more adaptive than traditional OPTIMIZE+ZORDER it only rearranges data that isn’t well clustered yet rather than always rewriting everything.Adapting to Change Without Rewriting Everything: In a partitioned table, if you realize a month later that queries would run faster partitioned by a different column, you’d have to repartition the entire dataset. With Liquid Clustering, you can simply issue an ALTER TABLE to change the clustering column set. The system will use the new keys for all future writes, while existing files remain as they are until an optimization is triggered. You can later run a full optimize to reorganize historical data under the new scheme if needed. This means you can respond to evolving query patterns without incurring an immediate cost for reprocessing the whole table.Better Concurrency and Fewer Conflicts: Because Liquid Clustering avoids overly granular partitions and heavy-duty clustering jobs, it also mitigates concurrency problems. Traditional partitions can suffer write conflicts if too many jobs target the same partition, and Z-order optimize jobs can conflict with concurrent writes. Liquid Clustering’s design results in fewer such bottlenecks.Performance Gains: Ultimately, the goal is faster queries and lower cost. By clustering data on the actual query predicates, Liquid Clustering improves data skipping. This leads to less IO and faster execution. In one benchmark, Databricks observed that a 1 TB warehouse dataset clustered with Liquid Clustering ran 2.5× faster to optimize (cluster) than using Z-Ordering, and yielded significantly better query performance than both partitioning or Z-Order. In real workloads, users have reported dramatic improvements; for example, Healthrise (a Databricks customer) saw some queries run up to 10× faster after enabling Automatic Liquid Clustering on their tables. We’ll discuss Automatic mode shortly. How Liquid Clustering Works (Under the Hood) At a high level, manual Liquid Clustering works by clustering data files on chosen key columns, while automatic Liquid Clustering adds an intelligent layer to choose and adjust those keys for you. Let’s break down the mechanisms: Clustering on Write: When you define clustering keys for a Delta table, the Delta engine ensures that newly written data is organized according to those keys.Maintenance and OPTIMIZE: Over time, as data is appended, you may still accumulate some fragmentation. The OPTIMIZE command can be used on a clustered Delta table to compact small files and sort data more finely according to the clustering columns. Unlike Z-Ordering, an optimize on a liquid-clustered table doesn’t always have to rewrite all files it focuses on incremental clustering, merging files that are sub-optimally placed. You can think of it as tightening the clustering. If you change the clustering columns via ALTER TABLE, you can run OPTIMIZE FULL to recluster all existing records under the new key order. In normal operation, Databricks recommends running periodic OPTIMIZE to keep performance optimal, but these operations are more lightweight than traditional heavy Z-order jobs.Data Skipping with Statistics: Delta Lake maintains statistics that the query engine uses for data skipping. Liquid Clustering maximizes the effectiveness of data skipping by ensuring those min/max ranges align with query filters. Enabling Automatic Clustering To use Automatic Liquid Clustering, you need to have Predictive Optimization enabled for your workspace (this is the feature in Unity Catalog that handles these background optimizations). Many new Databricks accounts have this on by default since late 2024, but it can also be enabled via the account console (under Feature Enablement). Assuming it’s enabled, turning on Automatic clustering for a table is straightforward: SQL: Use the CLUSTER BY AUTO clause when creating or altering a Delta table. For example, to create a new table in Unity Catalog with auto clustering: SQL -- Creating a Unity Catalog managed table with Automatic Liquid Clustering CREATE TABLE main.analytics.user_events ( user_id STRING, event_type STRING, event_date DATE, details STRING ) CLUSTER BY AUTO; -- enables automatic liquid clustering on this table SQL ALTER TABLE main.analytics.user_events CLUSTER BY AUTO; This instructs Databricks to begin monitoring the table’s workload and to auto-select clustering keys for optimal performance. The table does not need to have any manual keys set; the system will determine them. (Under the hood, the first time it chooses keys, it will update the table’s metadata with those columns as clustering keys.) PySpark API: In code, you can also enable auto clustering when writing data. For instance, using the DataFrame Writer API in PySpark: Python # df is a DataFrame we want to save as a Delta table with auto clustering df.write.format("delta") \ .option("clusterByAuto", "true") \ .mode("overwrite") \ .saveAsTable("main.analytics.user_events_auto") The above will create the user_events_auto table as a Unity Catalog managed table with automatic clustering enabled. (If you want to provide an initial hint for clustering columns, you can combine .clusterBy("col1", "col2") with the clusterByAuto=true option, but it’s not required – the system will figure it out if you leave it open.) Once Automatic mode is on, no further action is needed from the user. Databricks will handle running background optimize jobs as needed. It’s worth noting that these maintenance operations run on a serverless compute in the background. The benefit is you no longer need to schedule OPTIMIZE or VACUUM on your own; predictive optimization will run them at optimal times. Using Manual Liquid Clustering (Custom Clustering Keys) In some cases, you may want to manually specify the clustering columns. Unity Catalog supports manual Liquid Clustering on managed tables as well. Here’s how to use it: Table Creation with Cluster Keys: You can define clustering keys in the CREATE TABLE statement via a CLUSTER BY clause. For example: SQL -- Create a Delta table clustered by specific columns (manual clustering) CREATE OR REPLACE TABLE main.analytics.sales_data ( sale_id BIGINT, region STRING, product STRING, sale_date DATE, amount DECIMAL(10,2) ) CLUSTER BY (region, sale_date); In this example, the table’s data will be clustered by region and sale_date. This means each file written will tend to contain a narrow range of region values and sale_date values. This is analogous to creating a partitioned table on multiple keys, but without creating separate directories for each region or date. Altering an Existing Table: If you have an unpartitioned Delta table and want to enable clustering on it, use an ALTER statement. For instance: SQL ALTER TABLE main.analytics.sales_data CLUSTER BY (region, sale_date); This will register region and sale_date as the clustering keys for sales_data. As mentioned, this does not rewrite existing files immediately. It flags the table so that future writes will be clustered by these keys. Any new data you append or merge into sales_data will now be written in clustered order. Data that was already in the table remains in its original layout until you optimize. Reclustering Existing Data: To apply the new clustering to old files, you can run an OPTIMIZE operation. For a large table, you might do this during a maintenance window. For example: Python OPTIMIZE main.analytics.sales_data; The above will compact small files and cluster data incrementally. If you recently changed the clustering keys and want to force a full re-cluster of all data under the new key order, use OPTIMIZE main.analytics.sales_data **FULL**. An OPTIMIZE FULL will read and rewrite all files in the table, arranging them according to the current clustering columns. In most cases, a regular OPTIMIZE will suffice, as it will naturally pick up new keys over time. PySpark Write with Clustering Keys: You can also write data from Spark with clustering, similar to how you’d write partitioned data. For example: Python # Given a Spark DataFrame df, write it to a Delta table with clustering on specified keys df.write.format("delta") \ .mode("append") \ .clusterBy("region", "sale_date") \ .saveAsTable("main.analytics.sales_data"); Here, .clusterBy("region", "sale_date") ensures the data in df gets written out clustered by those columns. If the table sales_data was not already created, this will create it with those cluster keys. Finally, remember that Liquid Clustering is supported only on Delta tables with the latest protocols. Enabling it will bump your table’s Delta protocol version which older clients cannot read. In a Databricks environment this is usually not an issue, but be cautious if you have external readers/writers that might be using older Delta Lake libraries. Conclusion Liquid Clustering represents a major evolution in data layout management for the Lakehouse. By moving beyond the rigidness of partitioning and the heavy operational cost of Z-Ordering, it delivers a simpler and more adaptive way to optimize tables. For Data Engineers, this means less time agonizing over partition strategies and maintenance jobs, and more time focusing on data and insights. With Unity Catalog’s Automatic Liquid Clustering, the process is taken a step further clustering becomes a self-driving process, leveraging query insights to continuously improve performance. In summary, Databricks Liquid Clustering dynamically organizes data based on actual usage, can adjust without expensive rewrites, and has been shown to boost query performance significantly. As you design your next Delta Lake tables in Unity Catalog, consider leveraging Liquid Clustering from the start it can simplify your architecture and ensure your tables automatically stay optimized as your data (and its use cases) grow.

By Seshendranath Balla Venkata

Architecting an Embedded Efficiency Layer: A Platform Deep Dive into Day-Two Operational Tuning

Editor’s Note: The following is an article written for and published in DZone’s 2026 Trend Report, Platform Engineering and DevOps: How Internal Platforms, Developer Experience, and Modern DevOps Practices Accelerate Software Delivery. I am developing a reference guide for platform teams that want continuous optimization embedded directly into their internal developer platforms. In this proposed model, “done” means automated, full-stack tuning recommendations that fit safely and seamlessly into existing engineering workflows. Building golden paths for pre-deployment tasks is relatively straightforward because engineering teams share the primary goal of shipping applications faster. However, after deployment, sustained efficiency frequently becomes a neglected task that is “someone else’s job.” Developers prioritize shipping, SREs protect safety buffers, and FinOps pushes for cost reduction. The reference model proposes a dedicated efficiency layer as a required platform capability designed to reconcile those priorities without requiring a replatform. In this one-layer deep dive, we focus only on the embedded efficiency layer: its interfaces, interaction model, and what it requires to be credible. Project Constraints I anchor my design on the assumption that engineering teams are already managing their production deployments through established IaC and GitOps practices. Unlike pre-deployment pipelines that often enforce strict corporate standards, a post-deployment efficiency optimizer cannot be rigidly opinionated. Every microservice possesses unique architectural characteristics and operational requirements that demand a highly configurable approach to system optimization. I recommend allowing teams to define explicit parameters based on the workload context, dictating whether a particular service requires a specific operational profile. ProfileIntentTradeoff Cost-first Aggressive cloud cost reduction Less headroom, higher reliability risk Performance-first Maximum throughput performance Higher cost (maybe), tighter buffers Reliability-first Expanded reliability buffer for unpredictable traffic spikes Higher baseline spend Architecting the Day-Two Golden Path Effective efficiency optimization requires an architectural deep dive beyond superficial cloud scaling metrics. The framework I recommend orchestrates continuous tuning across the entire technological stack, cascading from the underlying infrastructure nodes down through Kubernetes configurations and directly into the application runtime. Adjusting CPU requests and memory limits at the container level is mathematically insufficient if the underlying Java Virtual Machine or application runtime parameters remain poorly calibrated for those newly allocated resources. Consequently, the guide treats the underlying correlation engine as a mandatory architectural component for producing holistic configuration recommendations. FLOW: infrastructure metrics + Kubernetes signals + app monitoring → correlation engine → recommendations (infra/k8s/runtime) Figure 1: Full-Stack Optimization Layers The Interaction Model The foundational principle governing this architectural layer is an explicit human-in-the-loop (HITL) model. Fully autonomous, black-box changes erode trust when operators can’t see the reasoning behind configuration updates. Instead, the multi-dimensional tuning recommendations surface inside the developer’s GitOps workflow, presenting clear explainability about how a change affects latency, reliability, and cost. HITL ensures engineers retain final approval over critical production changes, but it introduces review latency and requires significantly more comprehensive explainability documentation for every recommendation. Scenario Walkthrough A critical microservice begins experiencing rising cloud costs alongside escalating p95 latency. The embedded optimization engine detects the drift, correlates the cross-stack metrics, and proposes two runtime adjustments via an automated GitOps pull request. The application owner reviews the generated explainability visuals, verifies that the tuning resolves the latency issue without violating any existing rule, and manually merges the request. The platform seamlessly applies the validated configuration and continuously tracks the resulting operational benefits. Figure 2: The Interaction Model That workflow only holds if the following choices are true: Capabilitytradeoffwhat makes it workable Tuning profiles Requires explicit rules definition Profile selection per service or category Full-stack tuning More complexity than infra-only Correlation across infra + app metrics GitOps surfacing Adds workflow touchpoints PR-based delivery in existing process Human in the loop Review PRs and recommendation docs Explainability visuals + approval step Takeaways Based on the framework in this reference guide, here is what I would tell someone building an embedded efficiency layer next, based on their involvement: Designing the interaction model: Prioritize operator trust and mathematical transparency over fully autonomous, unexplainable actions.Defining the technical scope: Ensure your engine tunes the entire stack, from the underlying infrastructure down to the application runtime, rather than settling for superficial cloud resource constraints.Navigating the sociotechnical divide: Treat the optimization layer as a collaborative platform capability that grounds the competing priorities of developers, reliability engineers, and FinOps, not a financial audit mechanism. This is an excerpt from DZone’s 2026 Trend Report, Platform Engineering and DevOps: How Internal Platforms, Developer Experience, and Modern DevOps Practices Accelerate Software Delivery.Read the Free Report

By Graziano Casto

CORE

DevOps and CI/CD

DZone's Featured DevOps and CI/CD Resources

Top DevOps and CI/CD Experts

The Latest DevOps and CI/CD Topics