DZone Spotlight

Wednesday, July 1 View All Articles »

If You Can Facilitate a Retrospective, You Can Audit Your AI

By Stefan Wolpers

CORE

TL;DR: The AI Delegation Audit Scrum teams inspect how the last Sprint went during the Retrospective. They are much less likely to inspect the work they have handed to AI, because no meeting on the calendar owns it. That gap is where a working AI automation quietly turns into risk: it keeps producing fluent, on-brand output long after the decision to trust it has expired. The AI Delegation Audit closes the gap by leveraging the facilitation skills teams already use in a Retrospective. Thesis: The Delegation Audit is the missing inspection cadence for delegated AI work. It checks four things: whether the work still meets the standard, whether the model still fits the task, whether the team can still stop the automation, and whether reviewed assistance has quietly become unreviewed automation. You can try it on one workflow in fifteen minutes. The Automation That Looked Healthy A product team automates its Friday stakeholder update in March. The setup is careful: the model drafts from the Jira board, the workflow owner reviews the draft, and it ships. For three months it works. In June, the same automation tells an enterprise prospect that a security feature is in production. No application code changed, and nobody touched the prompt. But the system around the automation had shifted: a descoped feature, a stale ticket title that survived in the product backlog, and a change in model behavior combined into a false update. The dangerous part was not a visible failure: the automation kept producing fluent, plausible, on-brand updates, which is exactly what made the degradation hard to notice. That points to the belief worth naming first: a workflow that still produces output is assumed to be still fully functioning. A working automation is not evidence that the delegation behind it is still valid, and validating it once, at setup, is not the same as keeping it valid. What the Delegation Audit Is The Delegation Audit of the A3 Framework borrows the facilitation pattern of a Retrospective, not the Scrum event itself. Instead of how the team worked, it examines how the team’s AI delegations are holding up: 45 to 60 minutes, monthly or every other Sprint, with a named owner and a slot on the calendar. In the A3 Framework, this is what the Automate category has always required. The moment you trust work to run with little or no human review, you owe it explicit rules and a recurring audit. Most teams adopt the rules and skip the audit because no one owns it. The Delegation Audit is that meeting, and it is the Inspect step of the AI Delegation Lifecycle. The name is deliberate: nobody in finance, security, or operations needs an agile glossary to understand what a delegation audit is or why a team runs one. The practice underneath is familiar: gather data, surface what changed, turn findings into decisions, and leave with owners. The Four Checks Each check inspects one way a delegation degrades after it goes live: Output and source drift: Does the work still meet its AI Definition of Done, and are the inputs still fit for use? Pull three recent outputs per workflow and trace each one back to its sources. Model updates change output quality in both directions without notice, and the inputs move along with them: stale records, changed permissions, and archived data that the model cannot tell from current facts. A polished summary built on stale data is still a failed delegation.Model fit: Is the assigned model still the right one? Look in both directions: a cheaper tier that no longer meets the standard, and a frontier model burning budget on work that a mid-tier now handles. The test is whether the model is sufficient for this task at this risk level, not whether it is the most capable one available. If your team runs a routing policy, this check feeds into it, and the cost side has its own treatment in token economics.Reversibility: Could you stop each automation today? Test the stop rules from your handoff: who pulls the plug, how fast, and whether that person still works here. An automation without a reachable owner is not delegated; it is abandoned, now posing a risk.Category creep: Which Assist work has become unreviewed Automate? Watch for the tell: review time per output trending toward zero. When a human approves a draft in 4 seconds, that is not review, and the work changed its A3 category without anyone deciding. Name it, then choose: promote it to Automate properly, with rules and a stop rule, or restore genuine review. Run It Like a Retrospective The agenda fits 60 minutes and will feel familiar: Data walk (10 min): Put the delegation inventory on the wall: every automated and assisted workflow, its A3 category, its model tier, its last audit date. Add usage or spend data if you have it. Look first, discuss later.Run the four checks in pairs (20 min): Assign workflows to pairs. Each pair runs all four checks on its workflows and marks each finding pass, drift, or fail.Re-classify (15 min): Walk through the findings. Every drift or fail gets a decision: change the A3 category, change the tier, update the AI Definition of Done, fix the stop rule, or retire the delegation. Retiring an automation that no longer earns its audit cost is a successful outcome of the meeting.Decisions and owners (10 min): Each decision gets a name and a date. A finding without an owner is one you will rediscover next time; don’t create waste.Close the record (5 min): Update the log: what moved, why, and who decided. Why Inspection Stopped Being Optional Two forces make a standing audit necessary now: The first is the models: they update on the vendor’s schedule, not yours. A change to how a model summarizes, refuses, or formats can move output quality with no signal on your side. An automation you validated once is running on assumptions that have quietly expired. The second is accountability: NIST organizes AI risk management around four functions: govern, map, measure, and manage. Inspection is the measure-and-manage half, and a team that only governs and maps has stopped before the work becomes operational. Set-and-forget is the default, and it compounds unseen until a drifted output becomes an incident in front of the wrong audience. The Record You Get for Free Each audit updates a dated log: workflow, owner, model tier, last checked output, drift finding, decision, and follow-up date. Stack those logs, and you have an inspection trail: evidence that your team’s AI adoption is controlled rather than assumed. When a stakeholder, for example, a prospect’s procurement team, asks how you govern your internal AI use, that trail is half the answer, and you wrote none of it as a separate report. It came out of one recurring meeting. What to Do in Your Next Retrospective Do not schedule a new event yet. Take one delegated workflow, the one that would embarrass you most if it drifted, and spend fifteen minutes of your next Retrospective running the four checks on it out loud: output and source, model fit, reversibility, category creep. You will probably find at least one answer that amounts to “nobody has looked since we set this up.” That single finding is enough to put the audit on the calendar. Conclusion A Retrospective keeps a team honest about how it works together. The Delegation Audit extends that same facilitation habit to the work the team handed to a model, where an automation can look healthy long after the decision to trust it has expired. When did your team last inspect an automation it trusts, and what would the four checks find if you ran them this week? Key Questions This Article Answers What Is a Delegation Audit? A Delegation Audit is a recurring 45- to 60-minute inspection of a team’s delegated AI work, run monthly or every other Sprint. It checks whether automated and AI-assisted workflows still meet the team’s standard, using the facilitation skills of a Retrospective. It is the Inspect step of the AI Delegation Lifecycle. What Does a Delegation Audit Check? Four things: Output and source drift (Does the work still meet its AI Definition of Done, and are the inputs still trustworthy?),model fit (Is the assigned model still the right one for the task and its risk level?),reversibility (Can you stop the automation today?), andcategory creep (Has Assist work become unreviewed Automate?). How Is a Delegation Audit Different From a Retrospective? Same skill, different subject. A Retrospective inspects how the team worked together. A Delegation Audit inspects how the team’s AI delegations are holding up, then turns each drift finding into a decision with an owner and a date. More

Can Rust Have Zero-Cost Dependency Injection?

By Dmytro Brazhnyk

Overview This article explores whether dependency injection (DI) can exist in Rust without sacrificing the language’s core philosophy of zero-cost abstractions. We will approach the question from three angles: Why dependency injection still matters in Rust, even for systems built with zero-sized types and compile-time guarantees.How DI evolved in other ecosystems, using Java as a reference point.A practical Rust-oriented approach to implementing DI with compile-time guarantees. We’ll also show how Rust traits enable DI patterns that scale across crates, preserving zero-cost guarantees. All Rust source code used in this article is available in this repository. Rust DI: The Problem Rust Hasn’t Solved Yet Rust has solved problems most languages haven’t even dared to touch: memory safety without a garbage collector, fearless concurrency, and powerful zero-cost abstractions. But there is a class of problems Rust hasn’t fully confronted yet. Not because Rust is incapable — but because these problems exist above the machine level. They are not about memory safety or performance. They are about composition, modularity, and architectural correctness in large systems. Managing dependencies between dozens or hundreds of components is fundamentally different from managing memory or threads. Rust gives us powerful primitives, but the question remains: How do we scale composition safely and maintainably? What “Enterprise” Really Means in Rust Terms When Rust developers hear enterprise, they often think slow, over-engineered, and bloated. But that perception is misleading. Enterprise systems are not bloated by accident. They are complex because composition eventually stops being trivial. The complexity comes from business requirements, not from the technology stack. Enterprise: The Burden We Can’t Avoid When a company reaches a certain scale, several things inevitably happen: Products serve thousands or millions of usersSystems integrate with vendors, partners, and third-party servicesTeams work independently on modules and featuresSoftware must evolve continuously without stopping the business These realities create architectural pressure. From a technical perspective, systems must support: Scalability: At multiple levels — both in terms of users and data, including hundreds, thousands, or millions, or even up to billions of concurrent users, as well as functional modules interacting across teams.Reliability: Systems run 24/7. Services must handle failures because dependencies on vendors, partners, or third-party services mean that failures are inevitable, and the system must continue operating despite them.Modularity: Independent teams need to work on isolated components without breaking other parts of the system.Flexibility: Infrastructure choices may change. Databases, messaging systems, or integrations might need to be swapped without rewriting the entire application.Observability: To detect and respond to performance bottlenecks, integration failures, or unexpected behaviors quickly.Extensibility: New products, markets, and regulations require systems to evolve incrementally rather than being rebuilt from scratch.Maintainable: Every business decision introduces new dependencies. And every dependency increases the complexity of the system’s composition. Ensuring that the system doesn’t become so convoluted that small changes introduce cascading errors. Even with Rust’s ownership model and strong type system, manually managing this dependency graph eventually becomes impractical. These pressures are not theoretical — they define the daily reality of enterprise software engineering. Every design decision must balance immediate business needs with long-term sustainability, especially under high concurrent load. Where Dependency Injection Becomes Relevant This is exactly where dependency injection becomes useful. DI allows systems to manage complexity by separating what components need from how those dependencies are created and connected. In practice, this means: Components declare their dependencies without constructing them directlyDependencies are provided externally, keeping components isolatedSystems evolve gradually without breaking existing modulesOptional features and plugins can be integrated without tightly coupling the system DI is not just a convenience. It is a structured approach to handling inevitable architectural complexity. Enterprise Isn’t Just Complexity — It’s Heterogeneity Large systems are rarely uniform. They typically contain: Independent components with their own dependency treesStateful infrastructure such as databases, caches, and message brokersOptional features and plugin-style modulesMultiple implementations of the same interface This heterogeneity appears naturally over time. Systems accumulate tools built years apart, libraries maintained by different teams, and components that survive long after their original authors have moved on. Enterprise systems grow gradually, and they rarely get the chance to start over. Rust does not eliminate these pressures. Any real system eventually faces them. Java’s Historical Perspective: DI Was Inevitable Java did not adopt dependency injection because it was fashionable. It adopted DI because large systems were becoming impossible to manage without it. Without DI, developers quickly ran into familiar problems: Tight coupling between componentsFragile initialization orderHard-coded dependencies scattered across the codebaseChanges in one module unexpectedly breaking another Dependency injection emerged as a discipline for managing complexity. Components declare what they depend on, and the system provides those dependencies when constructing the application. This separation allows systems to evolve without collapsing under their own architecture. DI in a Nutshell You can think of dependency injection as a kind of runtime composition system. If your application contains many services, modules, plugins, or optional components, something must assemble them and ensure they are wired correctly, and that role belongs to the DI system. DI is conceptually similar to package managers such as Cargo or Maven, but it operates at a different level: Package managers resolve dependencies between libraries at build time.Dependency injection resolves dependencies between components at runtime. Loading executable code into memory is easy — the operating system handles that. What is harder is creating objects, initializing them correctly, and ensuring that all components interact with the right dependencies. This becomes increasingly difficult as systems grow. Dependency injection addresses this problem directly. How Dependency Injection Is Typically Solved in Java Java provides one of the most mature ecosystems for dependency injection. Frameworks such as Spring or Guice automate object creation and dependency wiring almost entirely. Let’s revisit the same example from the previous section: a simple User Management API. We have two controllers: ReadController — retrieves users from a databaseWriteController — creates users and publishes events to a message broker Both controllers depend on infrastructure services that must be created and wired correctly. Without Dependency Injection In a traditional manual setup, object creation and wiring might look like this: public class Application { public static void main(String[] args) { Database database = new PostgresDatabase(); MessageBroker broker = new KafkaBroker(); ReadController readController = new ReadController(database); WriteController writeController = new WriteController(database, broker); // start application } } At first glance, this appears manageable. But as the application grows, the initialization code expands rapidly: Multiple infrastructure servicesOptional modulesConfiguration logicConditional wiring depending on the environment The main method eventually becomes responsible for constructing the entire dependency graph of the application. This approach becomes difficult to maintain and extremely fragile as the system evolves. Dependency Injection With Spring Dependency injection frameworks solve this by moving the responsibility of object creation and wiring to a container. Components simply declare what they need. @Service public class Database { } @Service public class KafkaBroker implements MessageBroker { } @RestController public class ReadController { private final Database database; @Autowired public ReadController(Database database) { this.database = database; } } Dependencies are declared in constructors, and the DI container automatically provides the correct instances. The application no longer manually constructs the object graph. Instead, the framework scans components and resolves dependencies automatically. Polymorphism in Java DI Java DI frameworks also support multiple implementations of the same interface. For example, an application may support several message brokers simultaneously: @Service public class KafkaBroker implements MessageBroker { } @Service public class RabbitBroker implements MessageBroker { } A controller can receive all implementations at once: @RestController public class WriteController { private final List<MessageBroker> brokers; @Autowired public WriteController(List<MessageBroker> brokers) { this.brokers = brokers; } } The DI container automatically collects all implementations of MessageBroker and injects them into the controller. This makes the system highly extensible: New brokers can be addedExisting ones can be removedThe controller remains unchanged The Cost of Traditional DI Java DI frameworks provide powerful capabilities, but they come with trade-offs: Dependency resolution happens at runtimeReflection is heavily usedErrors may only appear during application startupDependency graphs are not always fully visible to the compiler This runtime flexibility works well for the Java ecosystem, but it introduces overhead and reduces compile-time guarantees. Rust, on the other hand, encourages a different philosophy: If something can be verified at compile time, it should be. This raises an interesting question: Can Rust achieve the same flexibility of dependency injection while preserving compile-time guarantees and zero runtime cost? Journey into Rust Coding Let’s try to build a dependency injection approach in Rust gradually. We will follow the same conceptual example used in the Java section: A ReadControllerA WriteControllerMultiple implementations of a MessageBrokerAn abstraction for database connectivity Rust Without Dependency Injection In the first example, we will implement a small Rust application without dependency injection. However, we will introduce use-traits, which will later allow us to transition naturally to a dependency injection model. 1. Defining Database Interfaces First, let’s define the interface used to access the database. 1.1 DatabaseConnection Trait This trait represents an abstraction for database connectivity that can support multiple implementations (Postgres, MySQL, etc.). trait DatabaseConnection { fn read_query(&self, query: &str); fn write_query(&self, query: &str); } 1.2 UseDatabaseConnection Trait Next, we define a trait that allows components to request a database connection from a context. trait UseDatabaseConnection { type T: DatabaseConnection; fn database_connection(&self) -> &Self::T; } This trait will later be used as the foundation of dependency resolution. Instead of components knowing the entire application context, they simply declare that they require a DatabaseConnection. This keeps components decoupled from the full application structure. 2. Database Implementation Now we provide a concrete implementation of DatabaseConnection. #[derive(Default)] struct PostgresDatabaseConnection {} impl DatabaseConnection for PostgresDatabaseConnection { fn read_query(&self, query: &str) { println!("Reading from Postgres DB: {}", query) } fn write_query(&self, query: &str) { println!("Writing into Postgres DB: {}", query) } } For simplicity, this example only prints messages instead of connecting to a real database. In a real system, this could be implemented using any production database library. 3. Controllers Now we define the controllers responsible for performing application logic. 3.1 Controller Structs #[derive(Default)] struct ReadController {} #[derive(Default)] struct WriteController {} Rust allows structs with no fields. These zero-sized types have no runtime cost, but they still represent concrete types at compile time and can participate in abstractions. 3.2 Controller Use Traits Next, we define traits that expose controllers to other components. trait UseReadController { fn read_controller(&self) -> &ReadController; } trait UseWriteController { fn write_controller(&self) -> &WriteController; } These traits allow components to access controllers without knowing anything about the application context. 3.3 Controller Context Now we combine the previously defined traits into a context trait. trait ControllerContext: UseDatabaseConnection + UseReadController + UseWriteController {} This context describes the minimal environment required for controllers to function. Controllers will depend only on this trait instead of the full application context. 3.4 Controller Implementation Now we implement the controller logic. impl ReadController { fn do_something<C: ControllerContext>(&self, ctx: &C, argument: &str) { ctx.database_connection() .read_query(format!("SELECT * FROM table WHERE id = '{}'", argument).as_str()); } } impl WriteController { fn do_something<C: ControllerContext>(&self, ctx: &C, argument: &str) { ctx.database_connection().write_query( format!("UPDATE table SET value = 'new' WHERE id = '{}'", argument).as_str(), ); } } Notice something important here: The controllers do not know about the full application context. They only know about the traits they depend on. This means the controller and database code could already be extracted into separate crates, reusable by any application implementing the required use-traits. 4. Wiring the Application Now we wire all components together. 4.1 Application Context We define a struct that holds all application components. #[derive(Default)] struct ApplicationContext { read_controller: ReadController, write_controller: WriteController, postgres_database_connection: PostgresDatabaseConnection, } This struct acts as the composition root of the application. 4.2 Implement Use Traits Next we implement the previously defined traits. impl UseReadController for ApplicationContext { fn read_controller(&self) -> &ReadController { &self.read_controller } } impl UseWriteController for ApplicationContext { fn write_controller(&self) -> &WriteController { &self.write_controller } } impl UseDatabaseConnection for ApplicationContext { type T = PostgresDatabaseConnection; fn database_connection(&self) -> &Self::T { &self.postgres_database_connection } } By implementing these traits, ApplicationContext becomes capable of providing dependencies to components. 4.3 Controller Context Implementation impl ControllerContext for ApplicationContext {} Since ApplicationContext already implements the required traits, it automatically satisfies ControllerContext. 5. Running the Application Finally we run the application. pub fn run() { let ctx = ApplicationContext::default(); ctx.read_controller().do_something(&ctx, "argument"); ctx.write_controller().do_something(&ctx, "argument"); } Key characteristics of this approach: No dyn traitsNo Arc or RcNo runtime dependency container All wiring is resolved at compile time through generics and monomorphization. Multi-Threading An attentive reader may ask: Will this approach work in multi-threaded environments? In Rust, thread safety is typically ensured using the Send and Sync traits. These traits are automatically implemented by the compiler if all fields of a struct are also Send + Sync. We can verify thread safety with a compile-time assertion: const _: () = { const fn assert_send_sync<T: Send + Sync>() {} assert_send_sync::<ApplicationContext>(); }; If this compiles, the entire application context can safely be shared between threads. In real systems, some components (such as database connections) may not be inherently thread-safe. In such cases, a connection pool or synchronization mechanisms such as Mutex are required. This limitation is not related to the dependency injection approach itself, but rather to shared resource management in concurrent systems. What the Compiler Actually Generates If we inspect the compiled output with: cargo asm rust_di_example::main ... 26 │ lea rbx, [rsp, +, 32] 27 │ mov rdx, rbx 28 │ call qword, ptr, [rip, +, _ZN5alloc3fmt6format12format_inner17he42ed4cf3cdc276bE@GOTPCREL] 29 │ movups xmm0, xmmword, ptr, [rsp] 30 │ mov rax, qword, ptr, [rsp, +, 16] 31 │ movups xmmword, ptr, [rsp, +, 48], xmm0 32 │ mov qword, ptr, [rsp, +, 64], rax 33 │ mov qword, ptr, [rsp, +, 32], r14 34 │ mov qword, ptr, [rsp, +, 40], 18 35 │ mov qword, ptr, [rsp], rbx 36 │ mov qword, ptr, [rsp, +, 8], r13 37 │ lea rdi, [rip, +, .Lanon.63c02f0152e6743e61fdeaf76f1d4051.26] 38 │ mov rsi, rsp 39 │ call qword, ptr, [rip, +, _ZN3std2io5stdio6_print17hba8f5eda1e4e495eE@GOTPCREL] 40 │ lea rax, [rip, +, .Lanon.63c02f0152e6743e61fdeaf76f1d4051.27] 41 │ mov qword, ptr, [rsp, +, 32], rax 42 │ mov qword, ptr, [rsp, +, 40], 19 43 │ mov qword, ptr, [rsp], rbx 44 │ mov qword, ptr, [rsp, +, 8], r13 45 │ lea r14, [rsp, +, 48] 46 │ mov qword, ptr, [rsp, +, 16], r14 47 │ lea r15, [rip, +, _ZN60_$LT$alloc..string..String$u20$as$u20$core..fmt..Display$GT$3fmt17h9d11f1d81b352ac8E] 48 │ mov qword, ptr, [rsp, +, 24], r15 49 │ lea rdi, [rip, +, .Lanon.63c02f0152e6743e61fdeaf76f1d4051.7] 50 │ mov rsi, rsp 51 │ call qword, ptr, [rip, +, _ZN3std2io5stdio6_print17hba8f5eda1e4e495eE@GOTPCREL] 52 │ lea rax, [rip, +, .Lanon.63c02f0152e6743e61fdeaf76f1d4051.28] 53 │ mov qword, ptr, [rsp, +, 32], rax 54 │ mov qword, ptr, [rsp, +, 40], 21 55 │ mov qword, ptr, [rsp], rbx 56 │ mov qword, ptr, [rsp, +, 8], r13 57 │ mov qword, ptr, [rsp, +, 16], r14 58 │ mov qword, ptr, [rsp, +, 24], r15 59 │ lea rdi, [rip, +, .Lanon.63c02f0152e6743e61fdeaf76f1d4051.8] 60 │ mov rsi, rsp 61 │ call qword, ptr, [rip, +, _ZN3std2io5stdio6_print17hba8f5eda1e4e495eE@GOTPCREL] ... We see extremely flat assembly code with series of invocation to _ZN3std2io5stdio6_print17hba8f5eda1e4e495eE@GOTPCREL that is just printing subroutine in rust runtime. There are no runtime dependency resolution mechanisms, no dynamic dispatch, and no container logic. The generated code mostly contains calls to standard library functions such as printing. This demonstrates that the abstractions introduced here do not introduce runtime overhead. Why Use-Traits Matter At first glance, the use-trait might look like unnecessary indirection. Why not simply pass ApplicationContext directly to every component? The reason is crate-level decoupling. Enterprise applications often grow into multiple crates. Controllers, database access layers, messaging integrations, and domain logic are very often implemented as reusable libraries. For example, a Spring Boot actuator–style module may contain all layers inside the DB, provide REST API endpoints, and integrate with a monitoring aggregator service — it acts as a standalone sub-program. However, if a component directly depends on ApplicationContext, it becomes tied to the executable crate that defines it. That creates an architectural problem: Libraries would depend on the application crateThe application crate would depend on the libraries This circular dependency makes reuse impossible. Use-trait solve this by defining capability-based interfaces. Instead of depending on the application context, components depend only on the capabilities they require. Example: trait UseDatabaseConnection { type T: DatabaseConnection; fn database_connection(&self) -> &Self::T; } A controller does not know anything about the application structure. It simply requires that the context provides access to a database connection. impl ReadController { fn do_something<C: ControllerContext>(&self, ctx: &C, argument: &str) { ctx.database_connection() .read_query(format!("SELECT * FROM table WHERE id = '{}'", argument).as_str()); } } Because of this design: ReadController can live in its own crateThe crate only exports traits describing the capabilities it needsAny application can use the controller by implementing those traits The application context becomes an adapter, wiring together independent components. Application ├── implements UseDatabaseConnection ├── implements UseReadController └── implements UseWriteController This pattern enables a powerful architectural property: Components become fully reusable libraries, while the application remains responsible only for wiring them together. In other words, use-traits allow dependency injection to cross crate boundaries while preserving Rust’s compile-time guarantees. Without this indirection, the system collapses into a monolithic application context that cannot be decomposed into reusable modules. Limitations of This Approach Although this example demonstrates many useful properties, it is not yet a complete dependency injection system. The main limitation is that ApplicationContext still has too much knowledge about component internals. In real DI frameworks, modules often contain many components, initialization logic, and internal dependencies. For example, consider a Spring Boot module such as Spring Data. When you add the dependency to your project, it automatically provides: Database driver integrationConnection poolingRepository interfacesTransaction managementEntity scanningMetrics integrationHealth check integration All of this functionality is assembled automatically by the DI framework. From the application developer’s perspective, only minimal configuration is required. Real dependency injection modules therefore consist of entire subgraphs of components, not just individual services. In our example we intentionally introduced two controllers to demonstrate that even a simple module may contain multiple cooperating components. A complete dependency injection framework must also manage: Module compositionInitialization lifecycleDependency resolutionOptional componentsMultiple implementations This is where the real challenge begins. Rust With Dependency Injection To implement dependency injection in Rust, we will build iteratively. We start from the previous “no DI” approach and gradually close the gap toward a complete DI system. The good news is that we already have use-traits, and our components are decoupled. We can extract certain code into reusable modules. What’s missing for a true dependency injection system: ApplicationContext still has too much knowledge about the components it uses.Some wiring and initialization steps are still manual. Our goal is to move the wiring into DI modules, giving each component full control over how it is connected. Because we are still targeting compile-time injection, we cannot rely on runtime reflection (like Java DI frameworks do). Instead, we will push this logic into Rust macros, allowing compile-time wiring while preserving zero-cost abstractions. 1. Registering Components in ApplicationContext In traditional DI, the application knows which modules it depends on (like Spring Data). But modules themselves should control which components they export. In our previous example, ApplicationContext was a struct, and registering a component meant adding a field manually. This ties the application to module internals. We need a way to add fields to ApplicationContext automatically, without putting module-specific code into the executable. We can achieve this using the combine-structs crate, which provides macros to embed multiple structs into one. Each module defines an embeddable struct as a context extension. When imported, ApplicationContext automatically merges all fields from these extensions. 1.1 Context Extension for PostgreSQL #[allow(dead_code)] #[derive(Fields)] struct PostgresDatabaseContextExtension { postgres_database_connection: PostgresDatabaseConnection, } The Fields derive macro allows this struct to be merged into ApplicationContext. 1.2 Context Extension for Controllers #[allow(dead_code)] #[derive(Fields)] struct ControllerContextExtension { read_controller: ReadController, write_controller: WriteController, } The controller module exports two controllers. More components can be added without touching the main executable. 1.3 Embedding Context Extensions #[combine_fields(PostgresDatabaseContextExtension, ControllerContextExtension)] #[derive(Default)] struct ApplicationContext {} The combine_fields macro merges all fields from the context extensions. ApplicationContext now has all components automatically wired. 2. Providing Use-Traits Previously, wiring was done via use-traits. Now that ApplicationContext doesn’t know which components exist, modules must export use-trait implementations via macros. 2.1 Macro for Database Connectivity macro_rules! inject_postgres_impl { () => { impl UseDatabaseConnection for ApplicationContext { type T = PostgresDatabaseConnection; fn database_connection(&self) -> &Self::T { &self.postgres_database_connection } } }; } 2.2 Macro for Controllers macro_rules! inject_controller_impl { () => { impl UseReadController for ApplicationContext { fn read_controller(&self) -> &ReadController { &self.read_controller } } impl UseWriteController for ApplicationContext { fn write_controller(&self) -> &WriteController { &self.write_controller } } impl ControllerContext for ApplicationContext {} }; } 2.3 Injecting Components #[combine_fields(PostgresDatabaseContextExtension, ControllerContextExtension)] #[derive(Default)] struct ApplicationContext {} inject_postgres_impl!(); inject_controller_impl!(); The executable only calls these macros. Components remain isolated from the main application, and the wiring happens automatically. 3. Intermediate Conclusion At this stage: No component code has been changed.Modules can add or remove components freely.Components are decoupled from each other and from the container.Wiring happens automatically through macros and use-traits. This gives us a bare-minimum dependency injection system: application components are decoupled, wiring is automatic, and no single component needs full knowledge of the application. 4. Limitations Even though we now have a working DI mechanism, it isn’t fully production-ready: Initialization: Components may require setup before wiring.Lifecycle Management: Controlling initialization order, cleanup, or optional components can be challenging. Next, we will explore a Rust DI framework capable of automating component initialization and lifecycle management, moving closer to a complete solution. Dependency Injection and Initialization Cycle in Rust So far, we have built a dependency injection (DI) container where all components are stored as fields in ApplicationContext. The next challenge is initializing these components. Press enter or click to view image in full size The goal is to: Enumerate the fields of ApplicationContext.Identify which fields require initialization.Call an initialization method for each such component. Since we want everything to happen at compile time, we need a macro to generate a Rust method that calls init() on every tagged component without runtime loops or collections. I could not find an existing macro for this, so I implemented one myself. If you want the details, check the implementation here: di_macro/src/lib.rs. We will focus on how to use this macro, not how it works internally. Macro Example: Enumerating Tagged Fields Full example code: struct_enumerator.rs 1. Define a Struct with Tagged Fields #[allow(dead_code)] #[derive(Debug, FieldEnumerator, Default)] pub struct MyStruct { #[tag(init_listener)] field_1: i32, #[tag(init_listener)] #[tag(start_listener)] field_1_2: i32, field_2: i32, #[tag(start_listener)] field_3: i32, } FieldEnumerator is our custom derive macro.Fields can have one or more tags (init_listener, start_listener). 2. Define a Callback Macro macro_rules! my_callback { ($struct_name:ident, $field_name:ident, $listener_type:ident) => { println!( "struct = {}, field = {}, type = {}", stringify!($struct_name), stringify!($field_name), stringify!($listener_type), ) }; } For every tagged field, the callback macro is called at compile time.Arguments passed: struct_name, field_name, and listener_type. 3. Invoke the Field Enumerator pub fn run() { let my_struct = MyStruct::default(); println!("my_struct = {:?}", my_struct); enumerate_tags_MyStruct_init_listener!(my_callback); enumerate_tags_MyStruct_start_listener!(my_callback); } enumerate_tags_MyStruct_init_listener! and enumerate_tags_MyStruct_start_listener! are generated automatically by the FieldEnumerator macro.The macro expands into a flat sequence of println!() calls. Macro Example Output: // enumerate_tags_MyStruct_init_listener!(my_callback); // my_callback!(MyStruct, field_1, init_listener) println!("struct = {}, field = {}, type = {}", "MyStruct", "field_1", "init_listener") // my_callback!(MyStruct, field_1_2, init_listener) println!("struct = {}, field = {}, type = {}", "MyStruct", "field_1_2", "init_listener") //enumerate_tags_MyStruct_start_listener!(my_callback) // my_callback!(MyStruct, field_1_2, start_listener) println!("struct = {}, field = {}, type = {}", "MyStruct", "field_1_2", "start_listener") // my_callback!(MyStruct, field_3, start_listener) println!("struct = {}, field = {}, type = {}", "MyStruct", "field_3", "start_listener") Notice: No vectors, arrays, loops, or runtime collections — everything happens at compile time. Rust Dependency Injection with Initialization We can now use the same macro to enumerate all fields in ApplicationContext and initialize them. Code reference: di_init.rs We introduce a Configuration component to demonstrate how initialization can depend on runtime data. 1. Configuration Module #[derive(Default)] struct Configuration { run_arguments: &'static str, } #[allow(dead_code)] #[derive(Fields, Default)] struct ConfigurationContextExtension { configuration: Configuration, } trait UseConfiguration { fn configuration(&self) -> &Configuration; fn configuration_mut(&mut self) -> &mut Configuration; } macro_rules! inject_configuration_impl { () => { impl UseConfiguration for ApplicationContext { fn configuration(&self) -> &Configuration { &self.configuration } fn configuration_mut(&mut self) -> &mut Configuration { &mut self.configuration } } }; } Steps: Define the component struct (Configuration).Define a context extension for ApplicationContext.Define a use-trait (UseConfiguration) for wiring.Provide a macro to implement the trait on ApplicationContext. Note: Configuration is no longer zero-sized—it contains runtime data (run_arguments). 2. Database Connection Initialization 2.1 Update PostgresDatabaseConnection #[derive(Default)] struct PostgresDatabaseConnection { connection_string: String, } Now contains runtime data.Initialization depends on configuration. 2.2 Tag Component for Initialization #[allow(dead_code)] #[derive(Fields, ContextExtension)] struct PostgresDatabaseContextExtension { #[tag(init_listener)] postgres_database_connection: PostgresDatabaseConnection, } init_listener signals that the component requires initialization. 2.3 Define Initializable Trait trait Initializable<C> { fn init(ctx: &mut C); } Components implementing this trait can be initialized automatically. 2.4 Implement Initialization impl<C: UseConfiguration + UsePostgresDatabaseConnection> Initializable<C> for PostgresDatabaseConnection { fn init(ctx: &mut C) { println!("Init sequence = {}", ctx.configuration().run_arguments); ctx.postgres_database_connection_mut().connection_string = format!("Postgres DB on {}", ctx.configuration().run_arguments); } } Accesses ApplicationContext mutably for initialization of any of component. 2.5 Prepare ApplicationContext #[combine_fields( ConfigurationContextExtension, PostgresDatabaseContextExtension, ControllerContextExtension )] #[derive(Default, FieldEnumerator)] struct ApplicationContext {} inject_postgres_impl!(); inject_controller_impl!(); inject_configuration_impl!(); Added FieldEnumerator for tag enumeration.Configuration module bindings included. 2.6 Initialization Sequence impl ApplicationContext { fn init(&mut self) { fn call_init<T: Initializable<ApplicationContext>, F: Fn(ApplicationContext) -> T>( ctx: &mut ApplicationContext, _closure: F, ) { T::init(ctx); } macro_rules! init_callback { ($struct_name:ident, $field_name:ident, $listener_type:ident) => { call_init(self, |x| x.$field_name); }; } enumerate_tags_ApplicationContext_init_listener!(init_callback); } } How it works: call_initfunction This helper function takes a generic type T that implements Initializable<ApplicationContext>.It also takes a closure _closure of type Fn(ApplicationContext) -> T.The trick here: the Rust compiler monomorphizes the closure to the actual type of the field passed in, so T::init(ctx) is called with the concrete type.init_callback!macro The macro expands for each field tagged with init_listener.It calls call_init with the correct field from self, ensuring the proper Initializable implementation is invoked.enumerate_tags_ApplicationContext_init_listener!macro This macro iterates over all fields in ApplicationContext that are marked with #[init_listener].For each field, it invokes init_callback!, which triggers Initializable::init for that specific component. Key rust trick: By using the Fn trait and generics in call_init, the compiler resolves the actual type of the field at compile time. This avoids any runtime type checks and ensures zero-cost initialization while keeping strong type safety. 2.7 Running the Application pub fn run() { let mut ctx = ApplicationContext::default(); ctx.configuration_mut().run_arguments = "DB_URL=127.0.0.1:5555"; ctx.init(); ctx.read_controller().do_something(&ctx, "argument"); ctx.write_controller().do_something(&ctx, "argument"); } Sample Output: Init sequence = DB_URL=127.0.0.1:5555 Reading from Postgres DB on DB_URL=127.0.0.1:5555: SELECT * FROM table WHERE id = 'argument' Writing into Postgres DB on DB_URL=127.0.0.1:5555: UPDATE table SET value = 'new' WHERE id = 'argument' run_arguments successfully propagated into runtime data. Performance Considerations In this demo, some structs now hold runtime data — but this is intentional. It’s added to demonstrate initialization, just like in real applications where components manage runtime state. The wiring mechanism itself remains zero-cost: All bindings are resolved at compile time through monomorphization. Even with the initialization sequence broadcasting multiple init calls, the compiler generates a flat sequence of calls: no loops, no runtime collections, no dynamic dispatch — everything happens at compile time, efficiently. Limitations This approach is now mature and production-ready for wiring, decoupling, and initialization.Next steps can explore advanced topics, such as polymorphism and more complex runtime behaviors. Dependency Injection and Polymorphism This is the final example of the article and introduces what I would consider an advanced topic for the core engine of any dependency injection framework: polymorphism. Press enter or click to view image in full size Many DI frameworks handle basic dependency wiring well. For example, Java Spring Boot provides a very mature implementation. However, in many other DI implementations, one important capability is often missing — the ability to handle multiple implementations of the same abstraction in a flexible and compile-time-safe way. Let’s extend our example with a new requirement. New Requirement Our application should support multiple message brokers, for example: KafkaRabbitMQ After writing data to the database, the controller should publish a message to one or more brokers. However: The component does not know which brokers existThe container may contain multiple brokersThe DI framework must maintain this one-to-many relationship One component should be able to call many broker implementations without knowing which ones exist. To make things even more interesting, we introduce the concept of profiles. Each profile represents a different configuration of the application context. Example: Profile1 PostgreSQL databaseKafka brokerRabbitMQ broker Profile2 Oracle databaseRabbitMQ broker only See the complete example. Injection Macros and Profiles First, we slightly modify our injection macros so they accept the application context type as an argument. macro_rules! inject_configuration_impl { ($ctx:ident) => { impl UseConfiguration for $ctx { fn configuration(&self) -> &Configuration { &self.configuration } fn configuration_mut(&mut self) -> &mut Configuration { &mut self.configuration } } }; } This change is necessary because the DI module does not know which profile will be used. Each executable can choose a different application context profile, and the macros must work with whichever profile is selected. Oracle Database Component Now we introduce a new database implementation. #[allow(dead_code)] #[derive(Fields, ContextExtension)] struct OracleDatabaseContextExtension {} And the injection macro: macro_rules! inject_oracle_impl { ($ctx: ident) => { impl DatabaseConnection for $ctx { fn read_query(&self, query: &str) { println!("Reading from Oracle DB: {}", query) } fn write_query(&self, query: &str) { println!("Writing into Oracle DB {}", query) } } impl UseDatabaseConnection for $ctx { type T = $ctx; fn database_connection(&self) -> &Self::T { self } } }; } Here we apply a small trick. Instead of defining a separate struct for the database connection, we implement the trait directly on the application context. This approach avoids additional boilerplate and works well when we know there will only be one database implementation per profile. Defining Message Brokers Now we define the abstraction for message brokers. Broker Interface trait BrokerSender { fn send_to_broker(&self, value: &str); } RabbitMQ Broker #[allow(dead_code)] #[derive(Default, Fields, ContextExtension)] struct RabbitMqContextExtension { #[tag(broker)] rabbit_mq: RabbitMq, } #[derive(Default)] struct RabbitMq; impl BrokerSender for RabbitMq { fn send_to_broker(&self, value: &str) { println!("{} sent to RabbitMq", value); } } Notice the important detail: #[tag(broker)] This tag allows the DI framework to enumerate all brokers automatically using the same mechanism we previously used for initialization. Kafka Broker Kafka is implemented in exactly the same way. #[allow(dead_code)] #[derive(Default, Fields, ContextExtension)] struct KafkaContextExtension { #[tag(broker)] kafka: Kafka, } #[derive(Default)] struct Kafka; impl BrokerSender for Kafka { fn send_to_broker(&self, value: &str) { println!("{} sent to Kafka", value); } } Publisher — Compile-Time Polymorphism Now comes the most interesting part. We define a Publisher component that sends messages to all available brokers. trait Publisher { fn publish(&self, value: &str); } Injection macro: macro_rules! inject_publisher_impl { ($ctx:ident) => { impl Publisher for $ctx { fn publish(&self, value: &str) { macro_rules! broker_callback { ($struct_name:ident, $field_name:ident, $listener_type:ident) => { self.$field_name.send_to_broker(value); }; } enumerate_tags!($ctx, broker, broker_callback); } } impl UsePublisher for $ctx { type T = $ctx; fn publisher(&self) -> &Self::T { self } } }; } The key idea: the publisher does not know which brokers exist. Instead, the FieldEnumerator macro generates code that calls send_to_broker for each tagged broker. This gives us: One-to-many relationshipCompile-time wiringNo dynamic dispatchNo runtime overhead Helper Macro for Tag Enumeration macro_rules! enumerate_tags { ($ctx:ident, $tag:ident, $callback:ident) => { paste! { [<enumerate_tags_ $ctx _ $tag >]!($callback) } }; } This macro simply dispatches to the procedural macro generated earlier. Application Profiles Now we define two different application contexts. Profile 1 #[combine_fields( ConfigurationContextExtension, PostgresDatabaseContextExtension, ControllerContextExtension, PublisherExtension, RabbitMqContextExtension, KafkaContextExtension )] #[derive(Default, FieldEnumerator)] struct ApplicationProfile1 {} Profile1 includes: PostgreSQLRabbitMQKafka Profile 2 #[combine_fields( ConfigurationContextExtension, OracleDatabaseContextExtension, ControllerContextExtension, PublisherExtension, RabbitMqContextExtension )] #[derive(Default, FieldEnumerator)] struct ApplicationProfile2 {} Profile2 includes: Oracle databaseRabbitMQ brokerno Kafka Initialization Macro for Context We move the previously used initialization logic into a reusable macro: macro_rules! application_context { ($ctx: ident) => { const _: () = { const fn assert_send_sync<T: Send + Sync>() {} assert_send_sync::<$ctx>(); }; impl Initializable<$ctx> for $ctx { fn init(ctx: &mut $ctx) { fn call_init<T: Initializable<$ctx>, F: Fn($ctx) -> T>( ctx: &mut $ctx, _closure: F, ) { T::init(ctx); } macro_rules! init_callback { ($struct_name:ident, $field_name:ident, $listener_type:ident) => { call_init(ctx, |x| x.$field_name); }; } enumerate_tags!($ctx, init_listener, init_callback); } } }; } Wiring Profiles Profile1 application_context!(ApplicationProfile1); inject_postgres_impl!(ApplicationProfile1); inject_controller_impl!(ApplicationProfile1); inject_configuration_impl!(ApplicationProfile1); inject_publisher_impl!(ApplicationProfile1); inject_rabbit_mq_impl!(ApplicationProfile1); inject_kafka_impl!(ApplicationProfile1); Profile2 application_context!(ApplicationProfile2); inject_oracle_impl!(ApplicationProfile2); inject_controller_impl!(ApplicationProfile2); inject_configuration_impl!(ApplicationProfile2); inject_publisher_impl!(ApplicationProfile2); inject_rabbit_mq_impl!(ApplicationProfile2); Running the Example fn do_run<T: Initializable<T> + Default + UseConfiguration + ControllerContext>() { let mut ctx = T::default(); ctx.configuration_mut().run_arguments = "DB_URL=127.0.0.1:5555"; T::init(&mut ctx); ctx.read_controller().do_something(&ctx, "argument"); ctx.write_controller().do_something(&ctx, "argument"); } pub fn run() { println!("Running Profile1"); do_run::<ApplicationProfile1>(); println!(); println!("Running Profile2"); do_run::<ApplicationProfile2>(); } Example Output Running Profile1 Configuration = DB_URL=127.0.0.1:5555 PostgresDB connection init sequence = DB_URL=127.0.0.1:5555 Reading from Postgres DB... Writing into Postgres DB... WriteController 'argument' sent to RabbitMq WriteController 'argument' sent to Kafka Running Profile2 Configuration = DB_URL=127.0.0.1:5555 Reading from Oracle DB... Writing into Oracle DB... WriteController 'argument' sent to RabbitMq Final Result With this approach we achieved: Compile-time polymorphismOne-to-many dependency injectionProfile-based application configurationNo dynamic dispatchNo runtime containerFully monomorphized wiring Everything is resolved at compile time while still supporting flexible application configurations. Conclusion: Can Rust Have Zero-Cost Dependency Injection? Throughout this article we explored whether Dependency Injection can exist in Rust without introducing runtime overhead. Traditional DI frameworks in languages such as Java rely heavily on reflection, runtime containers, dynamic dispatch, and runtime graph construction. These features make frameworks like Spring Boot extremely flexible, but they also introduce runtime complexity and performance costs. Rust approaches the problem differently. Instead of relying on runtime containers, the examples in this article demonstrate how compile-time composition can be used to build a dependency injection system. Using traits, generics, procedural macros, and compile-time code generation, we can construct an application context where: Component wiring happens at compile timeDependencies are resolved through traits and genericsInitialization logic can be generated staticallyPolymorphism can be implemented without dynamic dispatch Because Rust performs monomorphization during compilation, every dependency binding is resolved into concrete function calls. This means the final binary contains no reflection, no dynamic lookup tables, and no runtime dependency container. In other words, dependency injection becomes a compile-time architectural pattern rather than a runtime framework. We also demonstrated several important features typically expected from mature DI systems: Modular component composition through context extensionsControlled initialization sequencesOne-to-many polymorphism for components such as brokersConfigurable application profiles And all of this without introducing runtime cost or dynamic dispatch The result is a system where flexibility and performance are not in conflict. Rust’s type system and macro system allow us to design architectures that remain fully decoupled, while still producing simple, predictable, zero-cost binaries. This raises an interesting conclusion. Rust may never have a DI framework that looks like Spring Boot — and it probably shouldn’t. But Rust does allow dependency injection to exist in a different form, one that embraces the language’s philosophy: compile-time guarantees, explicit composition, and zero-cost abstractions. Future Directions The examples in this article intentionally keep the framework small in order to focus on the core ideas. However, a production-ready system would likely evolve further. For example, initialization often requires explicit ordering between components, where some services must be initialized before others. The current example also contains a fair amount of boilerplate, which could be significantly reduced with a more advanced procedural macro design. Heavier use of derive and attribute macros could also improve IDE code completion and developer ergonomics while keeping the system fully type-safe. Beyond the core container mechanics, several practical features naturally follow from this model: improved testing support, built-in mechanisms for mocking and stubbing components, and the ability to override components in derived profiles — a common requirement when building test environments or specialized deployments. Finally, dependency injection frameworks rarely exist in isolation. Systems such as Spring Boot succeeded not only because of their DI container, but because they provided a standard foundation for an ecosystem of reusable modules. A similar approach in Rust could allow libraries to integrate around a shared compile-time DI model, enabling a broader ecosystem of interoperable components while preserving Rust’s philosophy of explicit composition and zero-cost abstractions. More

Loop Engineering: The Layer After Prompt, Context, and Harness Engineering

By Vidyasagar (Sarath Chandra) Machupalli FBCS

CORE

Refcard #291

Code Review Core Practices

By Vidyasagar (Sarath Chandra) Machupalli FBCS

CORE

Refcard #403

Shipping Production-Grade AI Agents

By Vidyasagar (Sarath Chandra) Machupalli FBCS

CORE

AI-Augmented React Development: How I Rebuilt My Workflow Without Losing Control of the Code

Every React developer reaches a point where the sheer volume of boilerplate starts to slow them down. Prop drilling, repetitive hook patterns, component scaffolding, unit test setup — the cognitive overhead adds up fast, especially at enterprise scale. When GitHub Copilot entered my workflow, I expected a productivity boost. What I didn't expect was how much I'd have to think about using it correctly. After integrating AI-assisted development into a React 18 codebase — spanning custom hooks, context-based state management, and accessibility-driven UI — I came away with a clear picture of where AI genuinely accelerates the work, where it quietly introduces risk, and what guardrails every team needs before they ship AI-assisted code to production. This isn't a tutorial on setting up Copilot. It's an honest account of what changed in my day-to-day React workflow, and how I rebuilt my development process around the strengths of AI without surrendering architectural judgment. Where AI Actually Accelerates React Development 1. Component Scaffolding The most immediate win was generating boilerplate-heavy component shells. React functional components follow a predictable structure: imports, props interface, state declarations, effect hooks, render return. Copilot autocompletes this structure accurately and fast, especially when your file already has consistent patterns. For example, starting a new form component with a comment like: Plain Text // Controlled form component with validation and submit handler … triggers a usable scaffold within seconds. In a codebase with 50+ form components, this adds up to meaningful time savings. 2. TypeScript Prop Typing One of the most tedious parts of React 18 development is defining interface types for component props — especially for components consuming API response shapes. Copilot handles this well when the API shape is already defined elsewhere in the file or project. It infers prop types from usage context and generates clean interfaces without much guidance. 3. Unit Test Generation Copilot shines at generating @testing-library/react test cases for presentational components. Given a component file, it can suggest: Render testsUser interaction tests (click, input change)Accessibility checks using getByRole This reduced the time I spent on repetitive test scaffolding by roughly 40% for simple components. 4. Repetitive Hook Patterns Standard hooks like useEffect with cleanup, useCallback with dependency arrays, and useMemo for expensive computations follow well-known patterns. Copilot autocompletes these reliably — and the suggestions are often correct on the first try when the surrounding context is clear. Where AI Fails React Developers (and Why It Matters) This is the part most AI-workflow articles skip. In my experience, Copilot introduced subtle issues in three specific areas: 1. State Management Architecture Copilot is pattern-matching, not reasoning. When I was designing a context-based global state solution for a multi-step form flow, Copilot consistently suggested patterns that worked for isolated examples but didn't scale: it created redundant useContext calls across components that should have been wrapped in a provider, and it failed to account for re-render performance implications. The lesson: Never accept AI suggestions for state architecture without reviewing the component tree. AI optimizes locally; architecture requires global thinking. 2. Custom Hook Dependency Arrays Incorrect dependency arrays in useEffect and useCallback are a well-known React footgun. Copilot's suggestions here were hit-or-miss. It occasionally omitted dependencies that needed to be included and included stale values that triggered unnecessary re-renders. I started treating all AI-generated dependency arrays as drafts that required manual review against the ESLint react-hooks/exhaustive-deps rule. This step is non-negotiable. 3. Accessibility in JSX This one is subtle. Copilot generates functional JSX — but accessible JSX requires deliberate attention to ARIA roles, focus management, and semantic HTML. AI-generated components often defaulted to div-heavy markup without the aria-* attributes or keyboard event handlers that production apps require. For any component touching user interaction — modals, dropdowns, form controls — I reviewed AI-generated output against WCAG 2.1 AA standards before committing. My Rebuilt Workflow: A Practical Stack After months of iteration, here's the workflow that works: Phase 1: Design First, Prompt Second Before I open a new file, I sketch the component's responsibilities on paper or in a comment block: JavaScript /** * UserProfileCard * - Displays user avatar, name, role * - Supports edit mode toggle * - Emits onSave callback with updated values * - Must be keyboard accessible */ This comment becomes the Copilot context. The more specific the intent, the better the scaffold. Phase 2: Accept Scaffolding, Write Logic I accept Copilot suggestions for: Component shellProp interfaceState variable declarationsJSX structure for simple layouts I write manually: useEffect logic and cleanupEvent handler implementationsContext provider designError boundariesAny business logic touching API data Phase 3: Review AI-Generated Tests Copilot generates test scaffolding well. I review every generated test for: Correct use of userEvent vs fireEventAccurate assertions (not just "it rendered")Missing edge cases (empty state, error state, loading state) Phase 4: Accessibility Audit Pass Every component gets a final pass against: Semantic HTML element usagearia-label / aria-describedby for interactive elementsKeyboard navigation (tab order, focus trap for modals)Color contrast (handled at design system level, not component level) A Real Before-and-After Example Before (pre-AI workflow): A controlled input component with validation took roughly 25–30 minutes to scaffold, type, test, and review. After (AI-augmented workflow): The same component takes 10–12 minutes — with Copilot handling the initial scaffold and test shell, and me handling the validation logic, hook dependencies, and accessibility pass. Here's a simplified example of the kind of component where AI delivers the most value: TypeScript interface SearchInputProps { value: string; onChange: (value: string) => void; onSubmit: () => void; placeholder?: string; isLoading?: boolean; } const SearchInput: React.FC<SearchInputProps> = ({ value, onChange, onSubmit, placeholder = "Search...", isLoading = false, }) => { const handleKeyDown = (e: React.KeyboardEvent<HTMLInputElement>) => { if (e.key === "Enter") onSubmit(); }; return ( <div role="search"> <input type="search" value={value} onChange={(e) => onChange(e.target.value)} onKeyDown={handleKeyDown} placeholder={placeholder} aria-label="Search" disabled={isLoading} /> <button onClick={onSubmit} disabled={isLoading} aria-label="Submit search"> {isLoading ? "Searching..." : "Search"} </button> </div> ); }; The scaffold, prop interface, and JSX structure above were AI-generated in under 30 seconds. The aria-label attributes, role="search", and handleKeyDown implementation were my additions — things Copilot consistently missed in initial suggestions. Where AI Hits a Wall: Large-Scale Enterprise React Projects Small, isolated components are where AI shines. But real enterprise codebases are rarely small or isolated. Once you're working inside a large monorepo with hundreds of components, shared design systems, domain-specific business logic, and cross-team API contracts, AI-assisted development runs into a fundamental limitation: it only sees what's in its context window. Here's where that breaks down in practice: 1. Cross-File Dependency Awareness In a large React application, a single component may depend on a shared context provider defined four directories away, a utility hook maintained by a different team, and a TypeScript type exported from a core domain package. Copilot's autocomplete works within the file you're editing — it doesn't have a deep understanding of the full dependency graph. The result: AI-generated code that compiles locally but breaks at integration because it assumes a prop shape, import path, or context value that doesn't match what actually exists in the broader system. I've seen this surface most often with shared form validation schemas and API response types that live outside the component's immediate file tree. 2. Institutional Knowledge and Business Logic Enterprise React codebases carry years of intentional decisions that aren't documented anywhere in the code — they live in the heads of the team. Why is this particular component wrapped in a custom error boundary? Why does this dropdown use a local state copy instead of reading directly from context? Why is this API called twice? Copilot has no way of knowing. When it generates code in these areas, it produces something that looks reasonable but violates the implicit contract the team has built over time. Catching these violations requires a senior developer who understands the why behind the existing patterns — AI cannot substitute for that. 3. Design System Consistency at Scale Large teams typically maintain a shared component library — think an internal fork of Material UI or a custom design system. AI tools don't know which internal components to reach for. Copilot frequently suggests raw HTML elements or third-party components when the project has established internal equivalents: <Button> from your design system instead of <button>, <TextInput> from your library instead of a raw <input>. At scale, this creates design debt fast. Every AI-generated component that uses a raw HTML element instead of the design system equivalent is a component that diverges from your visual and behavioral standards — and accumulates technical debt that's expensive to audit later. 4. Performance Optimization in Complex Component Trees React 18 introduced useDeferredValue, useTransition, and concurrent rendering features specifically to handle performance in large, deeply nested component trees. These are nuanced APIs — their correct usage depends on understanding the rendering priority of specific subtrees, which operations are expensive, and what the user experience should be during transitions. Copilot-generated code in this area is almost always naive. It doesn't know that a particular list component renders 500+ items and needs virtualization. It doesn't know that a specific state update should be wrapped in startTransition to keep the UI responsive. Optimizing a large React application for performance remains deeply human work. 5. Multi-Team Merge Conflicts and Shared State In enterprise projects with multiple teams contributing to the same React codebase, shared state management becomes politically and technically complex. Redux slices, Zustand stores, or React Query caches span team boundaries. AI tools can suggest changes to these shared structures without awareness of how other teams depend on them — leading to breakages that only surface in integration environments. The practical takeaway: the larger and more interconnected the codebase, the more you need to treat AI as a localized assistant, not a system-aware collaborator. Use it to accelerate work on leaf-node components and isolated utilities. Treat any AI suggestion that touches shared state, cross-team APIs, or core infrastructure with the same scrutiny you'd give an external contributor who just joined the project. If you're introducing AI-assisted development into a React team, here are the non-negotiables: 1. Never merge AI-generated code without lint and type checks passing. Run eslint, tsc --noEmit, and your test suite before treating any AI-generated file as complete. 2. Establish a "no AI for architecture" rule. Component tree design, context structure, routing decisions, and data fetching strategy should be human-driven. AI is a code accelerator, not an architect. 3. Code review AI-generated PRs with extra scrutiny. Reviewers should specifically look for: missing hook dependencies, over-broad useEffect triggers, missing accessibility attributes, and logic that "looks right" but doesn't account for edge cases. 4. Document what AI touched. Some teams are beginning to tag AI-assisted code in commit messages or comments. This creates accountability and helps reviewers calibrate their scrutiny. 5. Keep your feedback loop active. When Copilot generates something wrong, reject it explicitly rather than accepting and editing. This helps calibrate your own pattern recognition for what AI does and doesn't handle well. What's Coming Next: Agentic React Workflows The current state of AI in React development is assistive — it completes what you start. The next wave is agentic: AI agents that can take a design spec or Figma export, scaffold an entire component hierarchy, wire up state, and generate test coverage — with a human reviewing the output rather than writing it line by line. Early tools like Cursor's Composer mode and experimental GitHub Copilot Workspace are beginning to move in this direction. For React developers, the implication is a shift in the skill that matters most: from writing components quickly to reviewing and evaluating AI-generated component systems critically. The developers who will thrive in this environment are those who deeply understand React's rendering model, state management tradeoffs, and accessibility requirements — not because they're writing every line, but because they're the final judgment layer on what ships. Conclusion AI-augmented development isn't about replacing React expertise — it's about redirecting it. The hours saved on scaffolding and boilerplate are hours you can reinvest in architecture, performance, accessibility, and code quality. The key insight from rebuilding my workflow around GitHub Copilot is this: AI is a force multiplier for what you already know well. If you understand React deeply, it makes you faster. If you're still learning React's mental model, it can quietly introduce patterns that seem right but aren't. Used with clear guardrails and deliberate review habits, AI turns a good React developer into a significantly more productive one — without sacrificing the code quality that enterprise applications demand.

By Sathwik Nagulapati

Text Summarization With OpenAI and Ruby on Rails

Modern applications deal with massive amounts of text — support tickets, CRM notes, blog posts, meeting transcripts, and internal documentation. The problem isn’t access to information anymore — it’s how quickly users can understand it. In our CRM system, we allow publishing long-form articles to a blog. However, users rarely want to read everything up front. To solve this, we introduced AI-powered summarization to generate short, readable previews. This improves: Content scanabilityUser engagementTime-to-information In this article, we consider: What text summarization isWhy AI summarization is powerfulSetting up OpenAI in RailsImplementing a summarization serviceBuilding a controller endpointHandling long documentsBackground processing with SidekiqReal-world use cases What Is Text Summarization? Text summarization is the process of condensing a large body of text into a shorter version while preserving its key information. There are two main approaches: 1. Extractive Summarization This selects the most important sentences directly from the original text. Example: Original: Ruby on Rails is a powerful web framework designed to make programming easier by favoring convention over configuration. Summary: Ruby on Rails is a web framework that simplifies development. 2. Abstractive Summarization This generates new sentences that capture the meaning of the text. This is where large language models like OpenAI shine. Why Use LLMs for Summarization? Traditional NLP methods struggle with context and nuance. OpenAI models can provide: Contextual understandingMulti-paragraph reasoningDomain adaptabilityNatural-sounding summaries This makes them ideal for summarizing in a project with different types of information processes: Blog postsDocumentsMeeting transcriptsCustomer feedbackKnowledge bases Setting Up OpenAI in a Rails project 1. First of all, install the OpenAI Ruby Gem. Add the gem: Ruby gem "ruby-openai" 2. Configure the API key. Add your API key to environment variables: Ruby export OPENAI_API_KEY="your_api_key" 3. Example initializer: Ruby OpenAI.configure do |config| config.access_token = ENV["OPENAI_API_KEY"] end Creating a Summarization Service In Rails, the best practice is to encapsulate OpenAI logic in a separate service object. Simple example: Ruby module Openai class Summarizer def initialize(text) @text = text @client = OpenAI::Client.new end def call response = @client.chat( parameters: { model: "gpt-4.1-mini", #or you can select another one messages: [ { role: "system", content: "You are a helpful assistant that summarizes text concisely." #you can define content with more detailed prompt }, { role: "user", content: "Summarize the following text:\\n\\n#{@text}" #also here define more detailed expected response } ], temperature: 0.3 } ) response.dig("choices", 0, "message", "content") end end end Possible roles: Plain Text Role Purpose system - > instructions for the model user - > input from the user assistant - > previous AI responses Usage: Ruby summary = Openai::Summarizer.new(article.content).call Example: Before and After Input (CRM article excerpt):Our platform allows teams to manage projects, track time, and generate reports across multiple departments... Output (AI summary): Centralized platform for project and time trackingSupports multi-department workflowsProvides reporting and analytics tools This summary can be shown in: Blog preview cardsTooltipsSearch results Example Controller Endpoint Now we expose this functionality via an API endpoint. Controller example: Ruby class Api::SummariesController < ApplicationController def create text = params[:text] summary = Openai::Summarizer.new(text).call render json: { summary: summary } end end Temperature controls randomness in the output. Plain Text 0.0 → deterministic 1.0 → very creative Additional Useful Parameters Also added parameters for improving summarization. 1. max_tokens: Limits the size of the generated response. Example: Ruby max_tokens: 200 #This prevents extremely long outputs. 2. top_pAlternative randomness control. Example: Ruby top_p: 0.9 #Usually you adjust temperature or top_p, not both 3. frequency_penalty discourages repeated phrases. Example: Ruby frequency_penalty: 0.2 #Useful when summaries become repetitive 4. presence_penalty encourages introducing new ideas. Example: Ruby presence_penalty: 0.1 #Not usually necessary for summarization, but can be used in specific tasks Prompt Engineering for Better Summaries and Why It Is Important The prompt design significantly impacts the output quality. Instead of a generic prompt: Plain Text "Summarize this text" Use structured instructions: Plain Text "Summarize the following text in 3 bullet points. Focus on the key ideas and avoid unnecessary details." This simple change improves clarity, consistency, and usefulness of the generated summary. In practice, prompt design becomes even more important when working with different types of content, such as technical documentation, CRM notes, etc. I wrote more about the features of Prompt Engineering on practical examples in my other article. Example of a more structured prompt: Ruby { role: "user", content: <<~PROMPT Summarize the following article in 5 bullet points. #{@text} PROMPT } Handling Very Long Documents LLMs have token limits, so large texts must be processed in chunks. Typical approach looks like this: Split text into chunksSummarize each chunkCombine summariesGenerate a final summary Avoid Naive Chunking This is not ideal: Ruby text.scan(/.{1,3000}/m) It may cut sentences in half. Prefer: Splitting by paragraphsSplitting by sentence boundaries Text Chunking Ruby class TextChunker def self.chunk(text) text.split("\\n\\n") end end Chunk Summarization Ruby def summarize_long_text(text) chunks = TextChunker.chunk(text) partial_summaries = chunks.map do |chunk| Openai::Summarizer.new(chunk).call end Openai::Summarizer.new(partial_summaries.join("\\n")).call end Using Background Jobs for Summarization Summarizing large text can take some time, so it’s better to process it asynchronously. Example of our service usage with Sidekiq: Generate Summary Job Ruby class GenerateSummaryJob include Sidekiq::Job def perform(article_id) article = Article.find(article_id) summary = Openai::Summarizer.new(article.content).call article.update!(summary: summary) end end Error Handling Always assume external APIs can fail. Ruby rescueStandardError=>e Rails.logger.error(e.message) fallback_summary end Also consider: RetriesTimeoutsMonitoring Cost Optimization When you use AI features in production, cost management becomes critical. depends primarily on token usage, meaning the more text you send and receive, the more you pay. Some tips for cost optimization that you need to know: 1. Limit Input Size The most effective optimization is reducing the amount of text sent to the AI model. Instead of summarizing an entire document, you can: Extract relevant sectionsSummarize those sections only Example filtering before sending to OpenAI: Ruby class TextPreprocessor MAX_LENGTH = 5000 def self.clean(text) text.strip[0...MAX_LENGTH] end end Usage: Ruby clean_text = TextPreprocessor.clean(article.content) summary = Openai::Summarizer.new(clean_text).call This ensures you never send extremely large inputs. 2. Choose the Right Model Not every task requires the most powerful model. For summarization, smaller models often perform well. Example: Ruby model: "gpt-4.1-mini" Advantages: Much cheaperFaster responsesGood summarization quality So, use larger models only for complex reasoning tasks. 3. Token Counting Before Requests Sometimes the text is larger than expected. Using a token estimation step helps prevent sending oversized prompts. Example: Ruby def too_large?(text) text.length > 12000 end If too large: chunk text into smaller chunkssummarize in parts (chunks) Conclusion Throughout this article, we built a summarization pipeline in Rails using a clean service-oriented approach: Simple summarization servicePrompt optimizationChunking for large documentsBackground processing with SidekiqCost and reliability improvements

By Denys Kozlovskyi

The 20 Software Engineering Laws

Most engineers learn these laws the hard way. When you try to rewrite something and it doesn’t deliver, or when a project is already late, adding engineers to the team will just make it fail faster. Sometimes, when you start using a metric to measure progress, the whole team will start trying to manipulate it. Then, six months later, someone mentions a 1975 law that addresses exactly what happened. I paid a price to learn this, too: I spent half my career learning these lessons the hard way, as many others probably did. The twenty laws listed below are the ones I refer to most often, although there are more (more on this later). Software development laws explain what is happening, what is about to happen, and what will not work no matter how hard you try. Some of these laws are sixty years old. They still apply to software development in 2026, and they will still apply in 2036 because they are not really about software. They are about people working together to build things under time pressure (basically, a lot of them are just laws of human nature). These laws are not rules that tell you what to do. They tell you what is already happening, but you still have to make the decisions. These laws just help you understand what is going on. Each of these laws made the list because I have experienced them myself. My book covers all fifty-six laws. If you only have time to remember twenty software development laws, these are the ones that I think are important. In particular, we will talk about the following laws: Gall’s Law: A complex system that works is always built from a simple system that worked first.KISS: Keep it simple. Anything beyond that is overhead.Conway’s Law: Organizations design systems that mirror their communication structure.Hyrum’s Law: With enough users, every observable behavior of your API becomes someone’s dependency, no matter what the contract says.CAP Theorem: A distributed system can guarantee only two of: consistency, availability, and partition tolerance.Zawinski’s Law: Every program expands until it can read mail. The ones that cannot are replaced by ones that can.Brooks’s Law: Adding people to a late software project makes it later.Ringelmann Effect: Individual output drops as team size goes up.Price’s Law: Half the work is done by the square root of the people.Dunning-Kruger Effect: The less you know about something, the more confident you tend to be.Hofstadter’s Law: It always takes longer than you expect, even when you account for Hofstadter’s Law.Parkinson’s Law: Work expands to fill the time available.Goodhart’s Law: When a measure becomes a target, it stops being a good measure.Gilb’s Law: Anything you need to quantify can be measured in some way that beats not measuring it.Knuth’s Optimization Principle: Premature optimization is the root of all evil.Amdahl’s Law: The speedup from parallelism is limited by the sequential part.Murphy’s Law: Anything that can go wrong will go wrong.Postel’s Law: Be conservative in what you send, liberal in what you accept.Sturgeon’s Law: 90% of everything is crap.Cunningham’s Law: The fastest way to get the right answer online is to post the wrong one. So, let’s dive in. How Systems Get Built 1. Gall’s Law A complex system that works is always built from a simple system that worked first. Systems do not work as well in real life as they do on paper because many problems do not surface until they hit the real world. These problems only appear when real users interact with systems, and by then, they either work or they do not. Every complex system that works got that way one step at a time. The systems that try to be perfect from the start usually fail. This is why most new versions of systems rewritten from scratch do not work out: teams keep all the features they had before, but lose the simple things that made the old systems good. Examples. Let’s take an example of Instagram. At the start, it was something else, but not a picture-sharing platform. The app was called Burbn, and it had: check-ins, gaming, photo sharing, all stuck together. Then, the founders cut everything except photo sharing, and the stripped-down core became the product. Google Wave went the other way. It launched with chat, email, a forum, and a document editor, all at once. Nobody could tell you what it was for, and it was dead in 15 months. 2. KISS (Keep It Simple, Stupid) Keep it simple. Anything beyond that is overhead. The KISS principle is a reminder that simplicity should be our key goal. If you can solve a problem with a 50-line script vs a complex 500-line solution, KISS favors the simpler solution because each line of code has the potential to cause an error. Why is simplicity so important? Software, in general, is complex to build and must be understood by humans. A simple design is much easier to maintain: new team members can get up to speed faster, bugs are easier to localize, and modifications cause fewer ripple effects. The KISS principle encourages developers to resist “clever” code that does too much at once, and to avoid architecting solutions that address future problems at the cost of current complexity. Example. Let’s say that we have a startup that needs a feature-flag system and decide to build a custom solution. They built it as a separate microservice with its own database, cache, admin UI, WebSocket notifications, and A/B testing support. It introduces a lot of complexity and takes a lot of time to build, which, if something goes wrong, can cause a lot of trouble. What they needed was a JSON config file. This would have taken an afternoon. 3. Conway’s Law Organizations design systems that mirror their communication structure. Your app architecture is already defined and essentially the same as your organization chart. For example, if you have four teams working on a project, you will probably end up with an app that has four parts. If the teams that work on the frontend, the backend, and the data do not communicate, your application will have three parts that do not work well together. If you rewrite your system without changing how your company is organized, you will still have the system, just written in a different language. The other way around works too. You can pick the architecture you want and then create teams that would naturally produce that kind of system. Amazon did this back in the 2000s. They broke their system down into smaller services managed by small teams, which changed how the system and the company worked together. This is called Inverse Conway’s Maneuver. Examples. Many modern AI organizations often split research from application engineering. Then, research optimizes benchmarks, while product ships apps against real users. The output is a model that scores well and a product that doesn’t work, because each side is optimizing for its own communication boundary. The pattern shows up at a small scale, too. A three-person team almost always ships a monolith because the cost of breaking it up is higher than the cost of keeping it together. 4. Hyrum’s Law With enough users, every observable behavior of your API becomes someone’s dependency, no matter what the contract says. The interface contract you wrote is not a proper contract. The real one is what your system actually does, including the parts you never expected to be important. For example, it could be timing, error message text, key order in JSON responses, and the exact bytes of a hash. Someone, somewhere, is depending on all of it. This is why backward compatibility costs so much in mature systems. This means that you actually don’t maintain the API you designed, but the accidental one. Examples. A good example is the SimCity game. I remember well that it had a use-after-free bug that worked fine on Windows 3.x because memory was never actually reclaimed. Then, Windows 95 reclaimed it, and SimCity crashed. Microsoft shipped Windows 95 with a special memory-allocator mode that was activated only when SimCity was running, so the bug would continue to work. Browsers do this at internet scale. Every quirk that web developers built into the platform effectively becomes part of it. The browser can’t change the quirk without breaking half the web. 5. CAP Theorem A distributed system can guarantee only two of the following: Consistency, Availability, and Partition tolerance. Networks fail. In a distributed system, that's not something you design around. It's something you accept. Once a partition happens, you have to pick: block writes to keep data consistent, or keep serving traffic and let replicas drift. Every distributed database makes this call. Most just don't tell you which one. They hide behind labels like "eventually consistent" or "highly available" and leave you to find out during an incident. Examples. MongoDB favors consistency, meaning that when a partition problem occurs, some MongoDB replicas will not accept any data until the entire system is working properly again. On the other hand, Cassandra will keep answering queries even when the replicas do not agree, and it will later fix the inconsistencies. Neither MongoDB nor Cassandra is wrong. They are just making choices about what your system can afford to lose. 6. Zawinski’s Law Every program expands until it can read mail. The ones that cannot are replaced by ones that can. Feature creep is not something that happens during the process. It is actually the process itself. When a tool is good at what it does, and people like it, they start using it all the time. The people in charge of the product want to keep the users engaged and stay on the platform. So the tool begins to take on tasks that are related to it. Over time, the tool becomes really slow and has a lot of unnecessary extra features. Then a new competitor comes along with a simpler version that does exactly the same thing. As the app's popularity grows, more and more unnecessary features are added. Examples. A famous example is Netscape, which started as a browser and ended as a suite with email, news, and a web editor. Firefox came as a fix and stripped it down, got popular, but then added plugins and a developer toolchain. We also remember Slack, which was launched to kill email and now has voice, video, bots, and an app directory. All of this is possible if the product doesn’t have the right north star metrics. How Teams Lose Speed 7. Brooks’s Law Adding people to a late software project makes it later. Software work is not easy to split among team members. When you bring someone new onto the project, it takes them a while to get up to speed, which means your experienced people have to stop what they are doing to help the new person learn. If your project is already behind schedule, adding more people won't make it go faster. It will just make things worse. Frederick P. Brooks said it well: you cannot have a baby in one month just because you have nine women pregnant. Software work is, like that, too. Software work does not get done faster just because you have people working on it. Example. Once, I was a team lead of eight people, and we were always behind schedule. My first thought was to hire two engineers to help us catch up. But in the meantime, while we were searching for new people, two people left us. It seemed that everything was now working better, communication was easier, and we managed to do more than before. So, obviously, the solution was to make the team smaller, not bigger. 8. Ringelmann Effect As teams grow, output per person falls. When many people pull on the rope, each person does not pull as hard. Some of this is because it is hard to work smoothly, and some of it is because people think someone else will do the part. Either way, this pattern is real. It is more extreme than most people think. Examples. A large GitHub study measured this directly. Developers on teams of 2-5 people averaged around 1,850 lines of code a month, while a team of 10 dropped to 1,200. At 50 or more, it was 450. Output per person fell 75%. This is why small teams ship faster than big ones, and why Amazon’s two-pizza rule holds true. It’s a defense against Ringelmann. This is especially true in today's AI-driven world, where productive teams have fewer members than before, as AI is driving up personal and team productivity. 9. Price’s Law Half the work is done by the square root of the people. In a group of 100 people, about 10 people actually do half of the work that matters. If you have a group of 16 people, it is likely that 4 people do most of the work. This is true for every creative field. The people in the group who do most of the work are really important, but the others are important too, because they do what needs to be done to support everyone else. They make sure everything runs properly (sometimes called glue work). So we need both groups, but the problem is that if the top people in your group leave, the group will lose a lot of its ability to get things done. Example. We all know that when Musk took over Twitter, it cut its staff by roughly 50%, and the site kept running. Price’s Law predicted that. What the law did not predict was what the layoffs removed: depth in trust and safety, SRE coverage, and incident response. The top performers kept the lights on. The organization lost the ability to handle the next hard problem, and Twitter quietly asked some laid-off people to come back. Why Plans Drift 10. Hofstadter’s Law It always takes longer than you expect, even when you account for Hofstadter’s Law. Let’s say you need to estimate how long something will take. You think four weeks is an estimate, but then you remember that your guesses are usually too optimistic, so you double it to eight weeks, just to be sure. But in the end, it takes sixteen weeks. Now you think, the next time you will be better, aren’t you? You think it will take sixteen weeks because that's what happened the last time. No, it now takes thirty-two weeks, because things you don’t know about surprise you. These are tasks such as unplanned integration issues or requirement changes. In practice, Hofstadter’s Law explains why techniques like padding estimates, awareness of Parkinson’s Law, and the use of historical data are essential, yet surprises still occur. Example. A good example of the Hofstadter law is the Berlin Brandenburg Airport project. The software integration process was taking much longer than expected, as it involved 75,000 sensors and 50,000 light fittings. The plan was to take 18 months to finish, but they later realized this was not possible and extended the timeline to 30 months. In the end, it took 7 years to complete, with a final cost of €7 billion. This was 2.5x higher than planned, and the airport opened 9 years late. 11. Dunning-Kruger Effect The less you know about something, the more confident you tend to be. Here is the uncomfortable part. The skill you need to do something is the same skill you need to judge how well you did the thing, and this is the problem. People who are not very good at something cannot see what they are doing wrong, so they think they are better at the thing than they really are. Yet, people who are good at it see all the things they are still getting wrong, so they think they are not as good at it as they really are. Examples. When asked when something will be done, new developers often give confident, precise estimates, while experienced developers give ranges (the famous “it depends” answer). The juniors aren’t wrong to be convinced. They simply don’t yet know what they don’t know (unknown-unknowns). People usually get really excited about new technology at first. This is because they have not used it a lot yet. We are seeing this happen with artificial intelligence now. The people who say AI can do anything are usually the ones who do not use it every day, like managers. 12. Parkinson’s Law Work expands to fill the time available. If you give a developer two weeks to do a task that can be done in two days, it will take two weeks to finish. This does not mean the developer is lazy or puts things off. People tend to fill up the time they have. Over the two weeks, the developer will likely spend time making plans, trying things, and adding extra tasks that do not need to be done (gold-plating). But if there was a deadline to have this done in a day, it would probably be done on that day. The thing about Parkinson’s Law is that it says if you give people a certain amount of time to do something, they will probably take all the time to do it. So, teams should set clear and realistic time limits (aka deadline-driven development). However, managers must use it judiciously, combining Parkinson’s insight with realistic scheduling. If you compress timelines too much, you risk running into Hofstadter’s Law, which reminds us that work often still takes longer than expected, even with buffers. Examples. A developer given two months for a one-week task will spend a month prototyping alternatives, another week on architecture debates, and the last three weeks polishing details nobody asked for. If we give the same task, but this time with a clear one-week deadline, it will be shipped in one week. How Metrics Distort Work 13. Goodhart’s Law When a measure becomes a target, it stops being a good measure. We can use many different ways to measure our work, e.g., number of bugs closed, number of incidents, test coverage, or team velocity. When we start measuring people's performance based on these things, they will focus on making those numbers look good instead of actually doing good work. The numbers will go up, but the work will not get any better. This is because when we give people incentives, they will do what gets them the reward, not what we really want. When we measure the wrong thing, people will do the wrong thing to get ahead. Examples. I watched a team get rewarded for lines of code written at the start of 2000, and the number of PRs created some years later. Developers started copy-pasting instead of extracting shared logic. Some created PRs for almost every commit they made. The modern version is AI tokens consumed per engineer (called tokenmaxxing). More tokens are being treated as a sign of productivity. 14. Gilb’s Law Anything you need to quantify can be measured in some way that beats not measuring it at all. Gilb's Law is like the side of the coin to Goodhart’s Law. You can say, when looking at Goodhart’s Law, that having metrics is bad, but that is actually not true. Not having any metrics is even worse than that. If something is important to you, you should try to find a way to measure it, because we cannot improve what we don’t measure (as Peter Drucker famously said). Example. Developer productivity is usually a hard thing to measure, and it always has been. We had many bad metrics, from lines of code to token consumption. But deployment frequency and change lead time give you a signal (as in the DORA metrics for DevOps) as a proxy. What Breaks Under Load 15. Knuth’s Optimization Principle Premature optimization is the root of all evil. Most performance work happens too early and in the wrong place. Teams optimize code paths that never become hot, introduce complexity they never need, and burn time solving a scale problem they may never earn. So the best way is to write the code that works, then check its performance. If there is a problem, a tool will show you where it is. If not, just move on. Examples. I worked at a startup once, where we spent a lot of time setting up Kubernetes. The thing was that we did it to handle millions of users, and we didn’t even have 10 users yet. We were making our infrastructure ready for a load that didn’t exist. Our product features were not even finished. One of my colleagues said that we should make sure 100 people even want our product before we worry about handling millions of users. He was right. We still launched late. 16. Amdahl’s Law The speedup from parallelism is limited by the sequential part. If 10% of your work has to be done in a sequential way, the work will only go 10x faster, no matter how many computers you use. If 50% of the work has to be done one thing at a time, the work will only go twice as fast. The same thing happens with people. If one group of people has to say yes to every decision, about how something is built, that limits how fast your team can work, no matter how many engineers you have. If you add engineers, but they all have to wait for the same group of people to say yes, the line of people waiting just gets longer. Your team of engineers will still be slow because the group of people making decisions is a bottleneck. The work of your team of engineers will only go as fast as the group of people making decisions. Examples. Scaling web traffic by adding more app servers helps until every request hits one shared database or authentication service. Then adding more horizontal scaling doesn’t help. The conversation about AI productivity is hitting the roof now. AI makes coding faster, but you still have to think, check, fix errors, and work together on those steps that can’t be done simultaneously. This sets the limit on how much you can gain in the end. That’s why some engineers see their work speed up by 10 times, and others see a 1.2 times increase. 17. Murphy’s Law Anything that can go wrong will go wrong. In software, Murphy’s Law is often mentioned to explain bugs and production incidents: whatever can go wrong in code (a null pointer, a race condition, a network outage) will eventually manifest, especially in large user bases or at the worst possible time (Friday evening). In practice, this law encourages developers to write more defensive code. This means checking for nulls, handling exceptions, validating inputs, and failing gracefully when errors occur. It also reminds DevOps teams to anticipate failures by implementing monitoring, enabling rollbacks, and maintaining contingency plans. Example. On July 19 2024, CrowdStrike made a change to the Falcon Sensor settings. This change caused a memory issue on Windows machines. It made 8.5 million Windows machines stop working and show a screen. To fix this problem, someone had to log in to each machine and apply the fix, because those machines could not start up. This could be done remotely. And this happened on a Friday morning when no IT staff members were working. It caused problems for airlines, hospitals, and banks. Everything that could go wrong did go wrong on the day, just like Murphy’s Law says. 18. Postel’s Law Be conservative in what you send, liberal in what you accept. This law says that if your server sends HTTP responses, it should format headers exactly per spec. But if your server receives an HTTP request with an uncommon header order or an unusual format, you should still process it rather than drop the connection, as long as you can interpret it safely. Browsers do this at a scale. Most of the HTML on the web is not written correctly, but modern browsers still render it. If they were strict, half the internet would not be found. But there is one thing to consider. Being too liberal has a cost: if everyone accepts anything, problems will never be corrected. There will be just more mess. In security-sensitive code, tolerating input can make it easier for attackers to find. So, the basic idea still holds. You need to use judgment, as being lenient is not the same as being permissive. Example. In APIs, say your service expects a timestamp. If it receives a timestamp without a time zone, instead of rejecting, maybe you assume UTC or try to parse it anyway, being liberal in acceptance. But when your service returns data, you always include the time zone to ensure the output is conservative and precise. How to Judge Better 19. Sturgeon’s Law 90% of everything is crap. Most things we make will go unused, and most of the code we write is not good. Most projects we start do not deliver the value that we thought they would. This is not a bad thing per se. This is how things are when we are trying to create something new. If we pretend everything is great, we will treat every project the same, which will make things too complicated. The projects that really matter are the ones, like 10% of them. Finding these projects and getting rid of all the others is what really takes skill. Example. WordPress has roughly 57,000 plugins in its directory. Over 34,000 haven’t been updated in the past 2 years, and nearly 19% have zero active installs. A small number of well-maintained plugins powers 40%+ of the public web. That distribution is Sturgeon’s Law in one screenshot. 20. Cunningham’s Law The fastest way to get the right answer online is to post the wrong one. When you ask a question on some online forum, you usually get no response. If you post something that is clearly incorrect, people will jump in to correct you. They might just walk by if they see a question, and then cannot help themselves when they see something that is wrong. You can actually use this to your advantage. If you are having trouble with something, do not ask how you should do it. Instead, propose a solution you know is not very good, or share a draft, and then see what happens. The right answer might come to you without you even asking for it. Note that this trick only works when the people around you know what they are talking about. If you are in a group where everyone’s just as confused as you are, then a wrong answer can actually cause more harm than good. In that case, the wrong answer can just become information that people start to believe. Example. The whole bet of wikis, and later Wikipedia, runs on this insight. People correct errors faster than they write articles from scratch. The bet paid off on a civilization-scale. Conclusion In this article, I shared some of the most impactful laws I saw in my career. You do not have to memorize all of them. The top five or six laws will help you solve most of your issues. The rest are there for when a new problem arises. What is more important is knowing when a law applies and when it does not. These twenty laws often conflict with each other. Knuth says do not optimize early. Amdahl says find and fix the part of your project that is slowing everything down. Both are correct at times. The key is to know which one to use now. Also, this list is my list. Your list will be different. The laws that have caused you problems will be more important to you than the ones that have not. Over time, you will add your laws. Write them down when you notice them. One line per project, incident, or rewrite. Which law helped you? Which law gave you advice? What changed? Your personal list will be more helpful to you than any list I can give you. Frameworks, platforms, and deployment models have changed since Brooks wrote his book in 1975. These laws have not changed. They describe the one thing that has not changed: humans building things together under constraints they do not yet fully understand. That is why they are worth learning before the project, not after it causes problems.

By Milan Milanovic

CORE

Fine-Tuning LLMs at Scale With Databricks MLflow and Spark

Why Fine-Tune on Databricks? General-purpose LLMs like Llama 3, Mistral, or Falcon are impressive out of the box — but they underperform on domain-specific tasks: medical coding, legal clause extraction, internal support ticket classification, and financial report summarization. Fine-tuning adapts a pre-trained model's weights to your domain using your proprietary labeled data. Doing this at scale introduces real engineering challenges: Training data lives in Delta Lake across dozens of tablesGPU clusters need to be orchestrated, not hand-managedExperiment tracking must be reproducible and auditableModels need a promotion workflow before they touch production traffic Databricks solves all of this in one platform: Apache Spark for large-scale data preparationMLflow (built-in) for experiment tracking, model registry, and lineageDatabricks Model Serving for one-click deployment with auto-scalingUnity Catalog for governed model and data access The ML Lifecycle Architecture Training Pipeline: End-to-End Flow The flow below shows how a single training run moves through the system — from a triggered job to a promoted model alias. Environment Setup Python # Databricks Runtime ML 14.x+ recommended (ships CUDA, PyTorch, Transformers) # Install additional packages in your cluster init script or notebook %pip install \ transformers==4.40.0 \ peft==0.10.0 \ trl==0.8.6 \ accelerate==0.29.3 \ horovod[spark]==0.28.1 \ datasets==2.19.0 \ evaluate==0.4.1 \ --quiet dbutils.library.restartPython() import os import mlflow import mlflow.transformers import torch from transformers import ( AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling, ) from peft import LoraConfig, get_peft_model, TaskType from pyspark.sql import functions as F from datasets import Dataset # ── MLflow setup ────────────────────────────────────────────────────────────── # On Databricks, MLflow tracking URI is pre-configured to the workspace # mlflow.set_tracking_uri("databricks") # uncomment for external clusters EXPERIMENT_NAME = "/Users/[email protected]/llm-finetuning/support-classifier" mlflow.set_experiment(EXPERIMENT_NAME) BASE_MODEL = "mistralai/Mistral-7B-Instruct-v0.2" CATALOG = "prod" GOLD_DB = f"{CATALOG}.gold" MODEL_NAME = f"{CATALOG}.ml.support_intent_classifier" # Unity Catalog model path print(f"GPU available: {torch.cuda.is_available()}") print(f"Device count: {torch.cuda.device_count()}") Preparing Training Data With Spark Spark handles the heavy lifting before training: filtering noisy records, formatting prompt-response pairs, and splitting the dataset. This stage runs on the CPU cluster — GPU nodes only spin up for the actual training job. Plain Text # ── Spark Data Preparation ──────────────────────────────────────────────────── def build_prompt(row): """ Format a support conversation into an instruction-following prompt. Uses the Mistral instruct template: [INST] ... [/INST] """ return f"[INST] Classify the intent of this support message:\n\n{row['message']} [/INST] {row['intent_label']}" # Load from Delta Gold table raw_df = ( spark.table(f"{GOLD_DB}.support_conversations") .filter(F.col("quality_score") >= 0.85) # keep high-quality labels only .filter(F.col("intent_label").isNotNull()) .filter(F.length("message") > 20) # filter empty/stub messages .filter(F.length("message") < 2048) # filter messages too long to tokenize .dropDuplicates(["message_hash"]) # remove exact duplicates .select("message", "intent_label", "created_date") .limit(500_000) # cap for this training run ) print(f"Training candidates: {raw_df.count():,}") # Build prompt strings using Spark — parallelized across all workers prompt_udf = F.udf( lambda msg, label: f"[INST] Classify the intent of this support message:\n\n{msg} [/INST] {label}", returnType="string" ) prepared_df = ( raw_df .withColumn("prompt", prompt_udf(F.col("message"), F.col("intent_label"))) .withColumn("token_count", F.size(F.split(F.col("prompt"), r"\s+"))) # rough word count proxy .filter(F.col("token_count") < 512) # stay within model context .select("prompt", "token_count", "created_date") ) # Stratified split using Spark (reproducible with seed) train_df, val_df, test_df = prepared_df.randomSplit([0.80, 0.10, 0.10], seed=42) # Persist splits to Delta for lineage + reproducibility train_df.write.format("delta").mode("overwrite").saveAsTable(f"{GOLD_DB}.llm_train_split") val_df.write.format("delta").mode("overwrite").saveAsTable(f"{GOLD_DB}.llm_val_split") test_df.write.format("delta").mode("overwrite").saveAsTable(f"{GOLD_DB}.llm_test_split") print(f"Train: {train_df.count():,} | Val: {val_df.count():,} | Test: {test_df.count():,}") Fine-Tuning With Hugging Face + MLflow Tracking We use LoRA (Low-Rank Adaptation) — a parameter-efficient fine-tuning technique that freezes the base model and only trains a small set of adapter matrices. This cuts GPU memory requirements by ~70% compared to full fine-tuning, making 7B parameter models trainable on a single A100. Python # ── LoRA Fine-Tuning with MLflow Autolog ───────────────────────────────────── # Convert Spark DataFrame to Hugging Face Dataset train_pd = spark.table(f"{GOLD_DB}.llm_train_split").select("prompt").toPandas() val_pd = spark.table(f"{GOLD_DB}.llm_val_split").select("prompt").toPandas() hf_train = Dataset.from_pandas(train_pd) hf_val = Dataset.from_pandas(val_pd) # Load tokenizer and base model tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, padding_side="right") tokenizer.pad_token = tokenizer.eos_token def tokenize(batch): return tokenizer( batch["prompt"], truncation=True, max_length=512, padding="max_length", ) hf_train_tok = hf_train.map(tokenize, batched=True, remove_columns=["prompt"]) hf_val_tok = hf_val.map(tokenize, batched=True, remove_columns=["prompt"]) # Load base model in 4-bit quantization (QLoRA) from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, ) base_model = AutoModelForCausalLM.from_pretrained( BASE_MODEL, quantization_config=bnb_config, device_map="auto", trust_remote_code=True, ) # Apply LoRA adapter config lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=16, # rank — higher = more capacity, more memory lora_alpha=32, # scaling factor lora_dropout=0.05, target_modules=["q_proj", "v_proj"], # attention layers to adapt bias="none", ) model = get_peft_model(base_model, lora_config) model.print_trainable_parameters() # Typical output: trainable params: 13,631,488 || all params: 3,765,522,432 || trainable: 0.36% # Training arguments training_args = TrainingArguments( output_dir="/dbfs/tmp/llm-finetune/checkpoints", num_train_epochs=3, per_device_train_batch_size=4, per_device_eval_batch_size=4, gradient_accumulation_steps=8, # effective batch size = 32 warmup_ratio=0.03, learning_rate=2e-4, fp16=False, bf16=True, # use bfloat16 on A100/H100 logging_steps=50, eval_strategy="steps", eval_steps=200, save_strategy="steps", save_steps=200, load_best_model_at_end=True, metric_for_best_model="eval_loss", report_to="mlflow", # pipe all metrics to MLflow automatically ) data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False) trainer = Trainer( model=model, args=training_args, train_dataset=hf_train_tok, eval_dataset=hf_val_tok, tokenizer=tokenizer, data_collator=data_collator, ) # ── MLflow Run ──────────────────────────────────────────────────────────────── with mlflow.start_run(run_name="mistral-7b-lora-v1") as run: # Log hyperparameters manually for full auditability mlflow.log_params({ "base_model": BASE_MODEL, "lora_rank": lora_config.r, "lora_alpha": lora_config.lora_alpha, "lora_dropout": lora_config.lora_dropout, "target_modules": str(lora_config.target_modules), "quantization": "4-bit QLoRA (nf4)", "train_samples": len(hf_train_tok), "val_samples": len(hf_val_tok), "epochs": training_args.num_train_epochs, "effective_batch": training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps, "learning_rate": training_args.learning_rate, }) # Train — metrics auto-logged to MLflow via report_to="mlflow" trainer.train() # Log final eval metrics explicitly eval_results = trainer.evaluate() mlflow.log_metrics({ "final_eval_loss": eval_results["eval_loss"], "final_eval_perplexity": torch.exp(torch.tensor(eval_results["eval_loss"])).item(), }) # Log the model + tokenizer as a single MLflow artifact mlflow.transformers.log_model( transformers_model={"model": trainer.model, "tokenizer": tokenizer}, artifact_path="model", task="text-generation", registered_model_name=MODEL_NAME, # auto-registers to Unity Catalog metadata={"base_model": BASE_MODEL, "finetuning": "QLoRA"}, ) run_id = run.info.run_id print(f"Run ID: {run_id}") print(f"Eval Loss: {eval_results['eval_loss']:.4f}") Distributed Training With Horovod on Spark For datasets beyond a few million tokens, or when you need to fine-tune models larger than 13B parameters, single-node training hits GPU memory walls. Horovod distributes training across multiple GPU workers using ring-allreduce — each worker holds a full model replica, and gradients are averaged across workers after every backward pass. Python # ── Distributed Fine-Tuning with Horovod on Spark ──────────────────────────── # Best for: datasets > 5M tokens, models > 13B params, or when you need # to reduce wall-clock training time below a business SLA. import horovod.torch as hvd from sparkdl import HorovodRunner def train_fn(hparams): """ Training function executed on each Horovod worker. Each worker trains on a data shard; gradients are averaged across workers. """ import horovod.torch as hvd from transformers import AutoModelForCausalLM, Trainer, TrainingArguments from datasets import load_from_disk hvd.init() # Each worker loads only its shard local_rank = hvd.local_rank() world_size = hvd.size() torch.cuda.set_device(local_rank) # Load dataset shard for this worker dataset = load_from_disk(f"/dbfs/tmp/llm-finetune/train_shards/shard_{local_rank}") model = AutoModelForCausalLM.from_pretrained( BASE_MODEL, torch_dtype=torch.bfloat16, ).to(f"cuda:{local_rank}") # Wrap optimizer with Horovod DistributedOptimizer optimizer = torch.optim.AdamW(model.parameters(), lr=hparams["lr"]) optimizer = hvd.DistributedOptimizer( optimizer, named_parameters=model.named_parameters(), compression=hvd.Compression.fp16, # compress gradient communication ) # Broadcast initial model weights from rank 0 to all workers hvd.broadcast_parameters(model.state_dict(), root_rank=0) hvd.broadcast_optimizer_state(optimizer, root_rank=0) training_args = TrainingArguments( output_dir=f"/dbfs/tmp/llm-finetune/hvd_output", num_train_epochs=hparams["epochs"], per_device_train_batch_size=hparams["batch_size"], bf16=True, no_cuda=False, dataloader_num_workers=2, # Only rank 0 logs and saves — avoids duplicated artifacts report_to="mlflow" if hvd.rank() == 0 else "none", save_strategy="epoch" if hvd.rank() == 0 else "no", ) trainer = Trainer( model=model, args=training_args, train_dataset=dataset, optimizers=(optimizer, None), ) trainer.train() # Only rank 0 registers the model if hvd.rank() == 0: mlflow.transformers.log_model( transformers_model={"model": model, "tokenizer": tokenizer}, artifact_path="model", registered_model_name=MODEL_NAME, ) # Launch distributed training across N GPU workers # np = number of processes = number of GPUs across all nodes hr = HorovodRunner(np=8, driver_log_verbosity="all") # 8 GPUs (e.g., 2 × 4-GPU nodes) hr.run(train_fn, hparams={ "lr": 2e-5, "epochs": 3, "batch_size": 2, # per GPU; effective = 2 × 8 = 16 }) MLflow Model Registry and Promotion Once a run completes, models land in the MLflow Model Registry. Databricks uses Unity Catalog-backed model aliases (candidate, staging, champion) instead of the legacy stage model. Python # ── Model Registry Promotion Workflow ───────────────────────────────────────── from mlflow.tracking import MlflowClient client = MlflowClient() # Get the latest registered version from the training run latest_version = client.get_registered_model(MODEL_NAME).latest_versions[0].version # Tag the new version as a candidate for review client.set_registered_model_alias( name=MODEL_NAME, alias="candidate", version=latest_version, ) client.set_model_version_tag( name=MODEL_NAME, version=latest_version, key="fine_tuned_on", value="gold.support_conversations", ) client.set_model_version_tag( name=MODEL_NAME, version=latest_version, key="eval_loss", value=str(round(eval_results["eval_loss"], 4)), ) # After human review / automated eval gates pass → promote to staging client.set_registered_model_alias( name=MODEL_NAME, alias="staging", version=latest_version, ) # After integration tests pass → promote to champion (production) client.set_registered_model_alias( name=MODEL_NAME, alias="champion", version=latest_version, ) # Load model by alias — decouples code from version numbers champion_model = mlflow.transformers.load_model(f"models:/{MODEL_NAME}@champion") Serving With Databricks Model Serving Python # ── Deploy to Databricks Model Serving ──────────────────────────────────────── # Can also be done via the UI: Models > Serving > Create Endpoint import requests, json WORKSPACE_URL = "https://<your-workspace>.azuredatabricks.net" TOKEN = dbutils.secrets.get("prod-scope", "databricks-token") endpoint_config = { "name": "support-intent-classifier", "config": { "served_models": [ { "name": "mistral-7b-lora-champion", "model_name": MODEL_NAME, "model_version": latest_version, "workload_size": "Small", # 1 GPU "scale_to_zero_enabled": True, "workload_type": "GPU_LARGE", # A10G } ], "traffic_config": { "routes": [ {"served_model_name": "mistral-7b-lora-champion", "traffic_percentage": 100} ] }, "auto_capture_config": { "catalog_name": CATALOG, "schema_name": "ml", "table_name": "support_classifier_inference_log", "enabled": True, # log all requests/responses to Delta } } } response = requests.post( f"{WORKSPACE_URL}/api/2.0/serving-endpoints", headers={"Authorization": f"Bearer {TOKEN}", "Content-Type": "application/json"}, data=json.dumps(endpoint_config), ) print(response.json()) # ── Query the endpoint ──────────────────────────────────────────────────────── def classify_intent(message: str) -> str: payload = { "inputs": {"prompt": f"[INST] Classify the intent of this support message:\n\n{message} [/INST]"}, "params": {"max_new_tokens": 50, "temperature": 0.1}, } resp = requests.post( f"{WORKSPACE_URL}/serving-endpoints/support-intent-classifier/invocations", headers={"Authorization": f"Bearer {TOKEN}", "Content-Type": "application/json"}, data=json.dumps(payload), ) return resp.json()["predictions"][0] print(classify_intent("My order hasn't arrived and it's been 10 days")) # → "shipping_delay" Comparing Fine-Tuning Strategies StrategyGPU MemoryTraining TimeQuality vs Full FTWhen to UseFull Fine-TuningVery High (80GB+)SlowestBaseline (100%)Max quality, large budgetLoRAMedium (24–40GB)Fast~95%Best general-purpose choiceQLoRA (4-bit + LoRA)Low (10–16GB)Medium~90–93%Single GPU, cost-sensitivePrefix TuningLowVery Fast~80–85%Minimal compute, quick iterationPrompt TuningVery LowFastest~70–80%Inference-only, no weight changeRLHF / DPOHighSlowestBest alignmentInstruction-following qualityDistillationMedium (teacher)MediumVariesSmaller, faster inference model Rule of thumb: Start with QLoRA on a single GPU. If eval loss stagnates or quality gates fail, move to LoRA on multi-GPU. Full fine-tuning is only warranted when you have >1M high-quality labeled examples and a measurable business case for the incremental quality gain. Key Takeaways Spark handles data at scale before training even begins — filtering, tokenization, and splitting across millions of records in minutes.QLoRA + LoRA makes fine-tuning 7B–13B models accessible on a single A100, reducing memory footprint by ~70% with minimal quality loss.MLflow report_to="mlflow" gives you automatic experiment tracking with zero extra code — every loss curve, gradient norm, and learning rate schedule is captured.Unity Catalog model aliases (candidate → staging → champion) replace brittle version-number references in deployment code, making promotions and rollbacks a one-liner.Auto Capture on Databricks Model Serving logs every inference request and response to a Delta table — giving you a feedback loop to build your next training dataset.Horovod on Spark is the right tool when single-node training exceeds your SLA — it leverages your existing Spark cluster without a separate orchestration layer. References Databricks — LLM Fine-Tuning on DatabricksMLflow — Transformers Flavor DocumentationHugging Face PEFT — LoRA & QLoRAQLoRA Paper — "QLoRA: Efficient Finetuning of Quantized LLMs" (Dettmers et al., 2023)LoRA Paper — "LoRA: Low-Rank Adaptation of Large Language Models" (Hu et al., 2021)Databricks — Model Serving (Foundation Model APIs)Horovod on Spark — Official DocumentationDatabricks — HorovodRunner APIDatabricks — Inference Tables (Auto Capture)"Training language models to follow instructions with human feedback" — InstructGPT / RLHF (OpenAI, 2022)

By Jubin Abhishek Soni

CORE

The New Senior Developer Job Description: Half Engineer, Half AI Systems Architect

She had everything on the list. Eight years of experience. Strong systems design. Distributed architecture under her belt. The panel interview went well — one of the hiring managers later described it as the best technical conversation they'd had with a candidate all quarter. The team passed on her. Two weeks later, during a casual conversation with that hiring manager, the reason came out. It wasn't her architectural skills or her communication. It was a question someone had slipped in near the end: "Walk us through how you'd set up an AI-assisted code review pipeline for a team that ships twelve microservices." She described doing it manually. The other finalist described standing up an orchestration layer with context-aware models, configuring fallback thresholds, and building observable feedback loops that trained the team's prompt library over time. Same job title. Completely different mental model of what the job now involves. That story isn't unique. It captures something that's been happening gradually over the past eighteen months and then very suddenly in the last six: the senior developer role has quietly split into two jobs. One of them is the job we all trained for. The other is the job that a meaningful portion of your working week now actually requires. And the gap between developers who've accepted that and developers who haven't is becoming very hard to explain away in performance conversations. The Split That Happened Without a Memo Let's be specific about what the "AI Systems Architect" half of the role actually means, because people either over-mystify it or undersell it. It doesn't mean you become a data scientist. It doesn't mean you're fine-tuning models or writing PyTorch. Those are real jobs — they're just different jobs. What it means is something more operational and less glamorous: you are now responsible for designing, maintaining, and improving the systems of AI assistance that your team works inside of, not just the code that the team produces. That sounds abstract until you break it into daily decisions. Which tasks should be fully AI-generated versus AI-assisted versus AI-reviewed only? Where are your model's blind spots for your specific codebase, and how do you account for them in code review? When a junior developer on your team gets a plausible-but-wrong architectural suggestion from an AI assistant, what's the escalation path? How do you measure the quality of your team's prompting over time? These aren't rhetorical questions — they're operational ones that live teams are answering right now, often badly, because no one assigned anyone to own them. Senior developers are getting assigned to own them. Not officially. Not with updated job descriptions. Just through the ordinary mechanism of "this problem needs solving, and you're the most experienced technical person in the room." What "AI Systems Architect" Actually Means Day to Day The phrase sounds bigger than the practice. What it actually breaks down to is four interconnected responsibilities that are now landing on senior developers, whether they want them or not. First: workflow design. Someone has to decide which parts of the development cycle use AI assistance, at what level of autonomy, and with what human checkpoints. At most companies, this currently happens by accident — everyone develops their own habits, and nobody compares notes. The developers who are stepping into the architect half of the role are the ones making that deliberate, rather than emergent. Second: model selection and configuration. Not fine-tuning, but product-level decisions: which models for which tasks, what context window strategy, how to handle codebases that exceed context limits, what fallback behavior looks like. These are practical engineering decisions that live in the space between "developer tool choice" and "infrastructure decision." They belong to senior engineers. Third: quality governance. AI-generated code introduces a new failure mode: plausible-looking outputs that are subtly wrong. The patterns of wrongness are specific and learnable. Senior developers who have mapped the failure modes of their AI tooling — the kinds of edge cases it consistently misses, the naming convention assumptions it gets backward, the security patterns it handles confidently and incorrectly — are providing a form of institutional knowledge that is genuinely hard to replace. Fourth: team prompting culture. This is the one nobody talks about at conferences yet, but engineering managers across the industry have been mentioning it consistently over the past six months: the quality variance in how different team members prompt their AI tools is enormous, and it compounds. Senior developers who build and maintain shared prompt libraries, who do prompt review the way they do code review, who can diagnose why a colleague got a bad output — those developers are operating as a force multiplier for the entire team, not just themselves. The Job Description Before and After: A Concrete Comparison This is worth making explicit. Analysis of actual senior engineer job postings — anonymized, from companies between 80 and 1,200 employees — shows a clear shift when comparing what the role requirements looked like in early 2023 versus what's being written now. The change is real and measurable. The pattern across all of it: the what of the role hasn't changed so much as the how and the governance around it. Senior developers are still responsible for the same categories of work. They're now also responsible for the design of the AI-assisted systems that help a team do that work, and for the failure modes those systems introduce. The New Core Competency Stack Here's what the competency model looks like in practice when you lay it out. The traditional side should feel familiar. The AI architecture side probably contains a few items you haven't formally owned yet — but if you've been doing this job for more than two years and paying attention, you've been building these skills without realizing it. The Salary Premium Is Already Real Compensation data lags reality by about eighteen months, so take specific numbers here with appropriate skepticism. What industry reporting suggests is that a clear pattern is emerging: developers who can demonstrably operate in both halves of the new role — not just use AI tools personally, but architect AI-assisted workflows for a team — are commanding a premium that's running somewhere between 18% and 31% above their single-track counterparts at the same years-of-experience mark. That range is wide. The premium is highest in companies that have recently invested in AI transformation initiatives and learned, the hard way, that "everyone uses Copilot" is not the same as "we have a coherent AI engineering strategy." Those companies are specifically recruiting for systems architect skills because they've already paid for the gap. How to Build the Second Half of the Job Nobody teaches this in a course yet. There are some good books and a growing number of blog posts, but the skills are mostly developed through deliberate practice and iteration. Based on teams that have successfully made this transition, here's what works. The starting point is mapping your team's current AI-assisted work honestly. Not aspirationally — honestly. Which tasks are you and your team currently doing with AI assistance? Where does the output go without sufficient review? What are the categories of error you've caught, and what categories might you be missing? This audit, done once and updated quarterly, is the foundation of a governance practice. From there, the most leveraged thing most senior developers can do is build a shared prompt library for their most common task types. Not a personal one — a shared one, with a versioning and review practice attached. The discipline of reviewing a colleague's prompt and explaining why it produced a wrong output is one of the fastest ways to build the mental model you need for the governance half of the role.

By Dinesh Elumalai

CORE

An Ingredient List Doesn't Stop the Worm: What SBOMs Can and Can't Do

On March 28, 2024, a Microsoft engineer named Andres Freund noticed something almost nobody would have bothered chasing: SSH logins on a system he was benchmarking were taking 500 milliseconds instead of the usual 100. He ran a memory profiler out of irritation more than suspicion, traced the slowdown to liblzma, the compression library bundled with xz-utils, and within a day had uncovered a backdoor planted by a maintainer who'd spent roughly two years earning the trust required to slip it in. The resulting CVE, 2024-3094, drew a perfect CVSS score of 10.0. It also handed the software security world an uncomfortable case study, one I still bring up whenever someone tells me their SBOM program has supply-chain risk handled. Here's why it's uncomfortable: an SBOM generated against the compromised xz-utils 5.6.1 release would have listed exactly that — xz-utils, version 5.6.1 — and it would have been completely accurate. The component was real, the version was real, and the entry would have sailed through every automated check looking for known-bad packages, because nobody knew it was bad yet. The malicious code wasn't an undisclosed dependency. It was hidden inside the build instructions of a package everyone already trusted, smuggled in through doctored upstream release tarballs rather than the public git history reviewers were actually watching. The ingredient list was correct. The ingredient was poisoned. Those are different problems, and conflating them is how organizations end up with a false sense of coverage. What the List Actually Buys You I don't want to undersell SBOMs here, because the underlying idea is sound and the win is real when an incident actually hits. When Log4Shell detonated in December 2021, the organizations that recovered fastest weren't necessarily the most sophisticated — they were the ones who could answer "where does Log4j live in our environment" in minutes instead of weeks, because someone had already built the inventory. That's the entire value proposition in one sentence: an SBOM turns "do we use this component, and where" from an open-ended archaeology project into a query. That value is now backed by regulatory teeth on both sides of the Atlantic. U.S. Executive Order 14028 pushed federal software vendors toward SBOM delivery starting in 2021, and the EU's Cyber Resilience Act has since raised the stakes for anyone selling software with digital elements into the European market: vulnerability and incident reporting obligations begin September 11, 2026, and the full SBOM and secure-by-design requirements land on December 11, 2027, backed by fines that can reach €15 million or 2.5 percent of global turnover. Compliance teams I talk to are treating this less as a paperwork exercise and more as a forcing function, which is the right instinct. But forcing functions only produce good outcomes if people understand what the artifact actually does — and what it was never built to do. When the Ingredient List Becomes the Worm If xz-utils illustrates a poisoned ingredient sitting still inside a static list, the npm ecosystem spent the back half of 2025 demonstrating what happens when the poison starts moving on its own. On September 15, security researchers identified a self-replicating piece of malware that came to be called Shai-Hulud, which spread by stealing developer credentials and npm publishing tokens, then using those tokens to inject itself into every other package the compromised maintainer had access to — silently republishing trojanized versions across the registry. It traced back to an account-takeover incident from late August known as the s1ngularity/Nx compromise, and by the time researchers had mapped it, more than 500 packages had been touched, including infrastructure used by CrowdStrike. Unit 42 later assessed, with moderate confidence, that the malicious shell script itself had been drafted with the help of an LLM — based on the comments and emoji left in the code, which is the kind of detail that makes this beat simultaneously fascinating and exhausting to cover. The worm didn't stay down. A second wave — Shai-Hulud 2.0 — surfaced in late November 2025, this time executing during the pre-install phase rather than post-install, which widened its reach into CI/CD pipelines well before any human reviewed the package contents. By the time defenders had a handle on it, the campaign had touched more than 25,000 GitHub repositories across roughly 350 accounts. Sonatype's 2026 State of the Software Supply Chain report puts the broader trend in context: more than 454,000 newly identified malicious packages in 2025 alone, pushing the cumulative known total past 1.2 million across npm, PyPI, and similar registries — a haul that reportedly even included output from North Korea's Lazarus Group, which alone published several hundred trojanized npm packages over the year. This is where the metaphor in this piece's title stops being a metaphor. An SBOM is a snapshot taken at build time. A self-propagating worm doesn't wait for your next build. By the time your inventory catches up to what's actually running in production, the compromised version may already have spread three hops further than the document describing it. Why Signing and Provenance Close Part of the Gap The honest fix isn't a better SBOM. It's pairing the SBOM with proof of where the artifact actually came from, which is what the Sigstore project and the SLSA framework exist to provide. Sigstore's components do three specific jobs: Fulcio issues short-lived signing certificates tied to a developer or CI identity via OIDC, instead of the long-lived private keys that inevitably end up mismanaged; Cosign signs and verifies the resulting artifacts; and Rekor records every signing event in a public, append-only transparency log, so a substituted artifact leaves a visible gap rather than a silent one. SLSA layers maturity levels on top of that: Level 2 is now realistic to reach in an afternoon on GitHub Actions, largely because GitHub's native attestation support has matured since 2024, and the Linux Foundation pushed out SLSA 1.2 in late 2025 with more granular tracking for both build and source provenance. Run the GhostAction incident from earlier in 2025 through that lens, and the gap becomes obvious. Attackers compromised a widely used third-party GitHub Action and modified its workflow code to exfiltrate secrets, and because downstream repositories had pinned that action by a mutable version tag rather than an immutable commit SHA, every project referencing @v1 automatically pulled the poisoned update with zero additional effort from the attacker. Signed provenance tied to a specific, verified commit wouldn't have stopped someone from compromising the upstream repository — but it would have made the substitution detectable the moment a consuming pipeline tried to verify what it was actually pulling, instead of trusting a tag that anyone with write access could quietly repoint. What a Mature Pipeline Actually Refuses to Run The pattern I'd point any engineering leader toward right now isn't exotic, it's just rarely implemented end to end: nothing gets promoted unless it clears a gate that checks signature, provenance, and SBOM together, not any one of the three in isolation. Plain Text Source Commit | v Build System | ----generate----> SBOM (CycloneDX/SPDX) | |--sign via Cosign---> Signature + SLSA Provenance (Rekor log) | v Deploy Gate <----checks all three----> [Signature valid? Provenance matches? SBOM clean of known CVEs?] | PASS --------> Production | FAIL --------> Blocked, alert raised, artifact quarantined Notice what that gate is actually doing: it isn't asking "do we have an SBOM," which is a yes/no compliance question. It's asking whether the artifact about to run matches the provenance it claims, whether that provenance traces to an approved build system, and whether the components it declares are still considered safe as of right now rather than as of whenever the document was generated. Kubernetes admission controllers and policy-as-code tools can enforce exactly this today — refusing to schedule any image lacking a valid signature, with human review reserved for the exceptions the policy can't resolve automatically. The Part Nobody Wants to Hear SolarWinds remains the cautionary tale everyone reaches for, and fairly, the absence of meaningful supply-chain visibility let that compromise propagate to roughly 18,000 customers before anyone outside the attackers understood the scope. But I'd argue the more instructive lesson of the past two years is the opposite kind of failure: organizations that have an SBOM, dutifully generated at every release, sitting in a compliance folder nobody has reopened since. Cloudsmith's research into current practice keeps surfacing the same pattern — SBOMs produced once at build time and then never looked at again, which makes them a point-in-time artifact masquerading as an ongoing control. My honest prediction for the next eighteen months: the EU's reporting deadline this September is going to force more genuine automation into supply-chain pipelines than three years of SBOM evangelism managed on its own, simply because a 24-hour reporting clock doesn't tolerate a quarterly spreadsheet review. Regulation rarely produces elegant security architecture. It does, reliably, produce urgency — and on this particular problem, urgency has been in short supply for exactly the wrong reason: the list looked complete, so everyone assumed the kitchen was safe.

By Igboanugo David Ugochukwu

CORE

A Low-Latency Routing Pattern for Multiple Small Language Models

A multi-SLM platform creates value only when specialization does not introduce a new latency tier. Small language models are inexpensive enough to dedicate to focused work such as extraction, code handling, safety filtering, or short-form reasoning, but that advantage disappears if model selection itself becomes expensive. Research on LLM routing shows that query difficulty varies enough for model choice to materially affect efficiency and quality, and modern serving stacks expose enough control over routing, batching, and cache locality to turn that insight into an operational design rather than an academic one. In practice, the routing layer has to behave like a tiny data-plane decision engine, not like another inference hop. Why Multiple SLMs Need Routing A single small model rarely gives the best latency-quality trade-off for every prompt type. Short structured requests, such as JSON extraction and classification, differ sharply from code repair, and both differ again from prompts that need broader reasoning. RouteLLM describes routing as assigning simpler queries to weaker models and reserving stronger models for harder cases, while FrugalGPT reports that a learned cascade can preserve strong-model quality with very large cost reductions. Although those papers evaluate broader LLM portfolios, the underlying lesson transfers cleanly to a fleet of small specialized models: heterogeneity in request shape makes heterogeneity in model choice economically and operationally rational. That conclusion rules out a router that behaves like another generative model call. RouteLLM explicitly treats effective routing as a pre-decision that minimizes cost and latency relative to broader multi-model execution, which means the dominant path should remain inside in-memory feature extraction and lookup. Prompt length, requested output shape, language, code markers, safety category, session identity, and prior cache affinity are all signals that can be computed before any model is invoked. A practical design target is to keep that first decision under a millisecond, so its cost remains far below prefill and decode work. The moment the main path depends on an additional model inference, the latency budget starts competing with the very SLM call it is supposed to optimize. Keeping the Decision Path Short The cleanest design is a two-stage router. The first stage is deterministic and resolves obvious cases immediately. A short request demanding strict JSON can go to an extraction model. A prompt containing fenced code, compiler errors, or repository paths can go to a code model. A safety-sensitive request can be pinned to a policy model. Only when simple predicates fail to produce a confident mapping should the second stage run, and that second stage should be a lightweight complexity scorer rather than another generator. Ray Serve’s request-routing API is built around this kind of custom replica selection, and its FIFO mixin is specifically intended for algorithms that can route requests as soon as they arrive without waiting for content-heavy processing. That is the right shape for an ultra-low-latency router: deterministic fast path first, optional scorer second. A routing metadata object makes that design practical because it compresses request interpretation into cheap primitives: Java record RoutingContext( int tokenCount, boolean codeRequest, boolean structuredOutput, String language, boolean repeatedPrefix, double complexityScore ) {} This record is deliberately plain. Primitive fields are cheap to serialize, cheap to log, and easy to replay during debugging. That choice aligns with PyTorch and vLLM production notes on disaggregated serving, where complex metadata objects in scheduler paths increased serialization cost and hurt inter-token behavior, and it fits the general shape of request routers that repeatedly rank candidate replicas under load. The complexityScore field should therefore come from a compact classifier or calibrated heuristic trained offline on task outcomes, escalation rates, or preference labels, not from a runtime SLM call. The router’s intelligence belongs in the thresholds and features, not in an extra generation step. The routing function should then read like admission control rather than orchestration: Java ModelTarget route(RoutingContext ctx) { if (ctx.structuredOutput() && ctx.tokenCount() < 800) return ModelTarget.EXTRACTION_SLM; if (ctx.codeRequest()) return ModelTarget.CODE_SLM; if (ctx.complexityScore() > 0.72) return ModelTarget.REASONING_SLM; if (ctx.repeatedPrefix()) return ModelTarget.GENERAL_SLM_CACHE_HOT; return ModelTarget.GENERAL_SLM; } The important detail is ordering. The cheapest predicates run first, the optional scorer appears only after clear task signals have been checked, and cache affinity refines the generic path instead of overriding obvious specialization. That mirrors how high-performance request routers rank candidates and then filter out replicas that are already saturated. Thresholds should be calibrated from observed latency and task-success data, but the architectural rule is stable: most traffic should leave the router with a decision produced entirely from fields already in memory. Making Selection Cache-Aware Cache-aware selection is where routing often starts to produce visible latency gains. vLLM’s automatic prefix caching reuses KV cache from earlier queries when a new request shares the same prefix, allowing shared prompt computation to be skipped, and its design notes describe prefix caching as close to a free lunch because it avoids redundant work without changing outputs. SGLang reaches a similar result with RadixAttention, which keeps reusable KV state in a radix tree, adds LRU eviction, and applies cache-aware scheduling to improve hit rate while introducing only negligible overhead when no cache hit occurs. That combination matters because a fast model on a warm prefix can easily outperform a nominally better model on a cold path. Routing without cache awareness, therefore, leaves substantial latency savings on the table. That is why a field such as repeatedPrefix, promptFamilyId, or session hash belongs in the routing context. Ray Serve exposes locality-aware and multiplex-aware helpers so that requests can prefer nearby replicas or replicas that already hold the relevant model, and Meta’s PyTorch and vLLM production write-up reports that sticky routing of the same session to the same prefill host significantly boosts prefix-cache hit rate, reaching 40% to 50% hit rate in the described deployment. The practical lesson is broader than that specific architecture. Similar prompt families should be steered toward the same warm replicas whenever possible, even if a purely load-balanced policy would have spread them evenly. Equal distribution is not the same thing as minimal latency once KV reuse becomes available. Keeping the System Fast in Production Once the routing logic is correct, the queueing policy and replica shape become the next sources of latency. Triton documents that dynamic batching combines requests to maximize throughput and allows bounded queue delay, while concurrent model execution and instance groups allow multiple copies of the same model to run in parallel on selected devices. That argues for selective rather than universal batching. Short extraction or moderation SLMs often benefit from aggressive batching because their service time is small and predictable, while interactive reasoning models need tighter queue-delay bounds to prevent batching from inflating p95 latency. Replica placement matters as well. Heavy or frequently chosen models deserve more parallel instances, and cold-start penalties should be reduced through explicit warmup, since Triton notes that model warmup can prevent the slow initial inferences seen before a model is fully initialized. Backpressure and observability complete the design. Ray Serve supports bounded queues and load shedding through max_queued_requests, and its autoscaling guidance ties lower ongoing-request targets to tighter latency objectives. Ray Serve LLM also exposes request latency, throughput, TTFT, and TPOT, while Triton exposes Prometheus metrics for GPU and request behavior. Those signals should be segmented by routed model, decision path, cache-hit class, and warm versus cold replica so that routing regressions become visible before they surface as user-facing tail latency. Without route-level telemetry, an apparently accurate router can quietly push traffic onto cold replicas, oversized queues, or cache-miss-heavy paths. In a low-latency SLM system, observability is not just for debugging. It is the only reliable way to keep routing policy aligned with actual serving behavior. Conclusion An ultra-low-latency routing layer for multiple SLMs is best treated as a serving primitive rather than as a separate intelligence feature. The strongest design keeps most requests on a deterministic first stage, invokes a lightweight complexity scorer only for ambiguous prompts, represents route state with compact metadata, and treats prefix locality as a first-class selection signal. Around that core, warm replicas, selective batching, bounded queues, and route-level observability determine whether specialization actually improves latency or merely rearranges it. When routing is cheaper than a single token step and cache locality is preserved instead of ignored, a multi-SLM system stops looking like a collection of models and starts behaving like a disciplined low-latency inference fabric.

By Akhil Madineni

How Agent Frameworks Solve Human-in-the-Loop

When we are demoing an agentic product, it always looks clean and clear: the agent pauses, the human approves or rejects, and execution continues. But what happens when the human actually says no? Human-in-the-loop (HITL) sounds like a single feature. In practice, it covers a wide design space: Do you pause mid-execution or notify asynchronously? Is the human a peer agent or an external approver?Can the human edit the action, or only approve or reject it?Does the framework resume execution exactly where it paused, or is there anything else? These questions yield different answers across all major agent frameworks, and those answers have very real production consequences. I assumed that all frameworks would converge on a single pattern for HITL design, but I found them to be very different. This article compares the six frameworks and their implementations of HITL. What You Will Learn By the end of this article, you will be able to: Distinguish the three fundamental HITL patterns - durable graph interrupt, message-loop injection, and blocking gate, and know which framework implements each.Read working code for all six frameworks and understand the exact execution pause and how it resumes for the frameworks.Pick the right framework for your use case. The Fundamental Divide The three distinct HITL patterns can be described as Durable graph interrupt: In this pattern, the execution graph serializes the entire graph state and suspends at the exact node where approval was needed. Nothing happens until a decision is made. If the process exits, then it's saved in an external checkpointer, and the run resumes from the point of suspension. Message loop injection: In this pattern, there is no suspension as such. Humans act as a first-class participant in a multi-agent conversation, steering a reply like any other agent. The loop runs continuously, and the human response is just another round.Blocking gate/run-termination: In this pattern, the framework runs or ends the run cleanly at a designated point, either blocking in process until the caller responds or terminating and returning an approval pending object that the human needs to resolve before resuming. Resuming the run is the human's responsibility. frameworkpatterntrue suspensionhuman can edit actionresumable after process restartdeepagentsGraph interrupt (LangGraph)✓✓ approve / edit / reject✓AgnoHumanReview on Step/Loop✓Partial✗AutoGenUserProxy agent (message loop)✗✓ via messages ✗OpenAI Agents SDKneeds_approval interruptPartial✗PartialCrewAIstep_callback + human_input on Task✗✗✗Pydantic AIDeferred tools (requires_approval)Partial✗✗ deepagents + LangGraph: graph-level interrupt Installation: Shell pip install deepagents langgraph # Python >=3.10 required # Docs: https://docs.langchain.com/oss/python/deepagents/human-in-the-loop deepagents resume/interrupt sequence deepagents uses LangGraph's interrupt()primitive. When the model produces a tool call that requires approval, execution suspends at that exact graph node. The serialized state is stored via a LangGraph checkpointer; the process can exit entirely and resume hours later. Wiring Up the Middleware Python from deepagents import create_deep_agent from langchain.agents.middleware import HumanInTheLoopMiddleware, InterruptOnConfig hitl = HumanInTheLoopMiddleware( interrupt_on={ # True = approve / edit / reject all allowed "delete_file": True, # Restrict to approve/reject only, with static description "run_bash": InterruptOnConfig( allowed_decisions=["approve", "reject"], description="Review this shell command before execution", ), # Dynamic description generated from the tool call at runtime "send_email": InterruptOnConfig( allowed_decisions=["approve", "edit", "reject"], description=lambda tool_call, state, runtime: ( f"Approve sending email to: {tool_call['args'].get('to')}" ), ), } ) agent = create_deep_agent( model="anthropic:claude-sonnet-4-6", middleware=[hitl], ) What the Reviewer Sees (HITLRequest Structure) Python # Surfaced to the reviewer when delete_file is triggered { "action_requests": [ { "name": "delete_file", "args": {"path": "/workspace/output.log"}, "description": "Tool execution requires approval\n\nTool: delete_file\nArgs: ..." } ], "review_configs": [ { "action_name": "delete_file", "allowed_decisions": ["approve", "edit", "reject"] } ] } The Three Decision Types Python from langgraph.types import Command # Approve — run as-is graph.invoke( Command(resume={"decisions": [{"type": "approve"}]}), config={"configurable": {"thread_id": "session-123"}, ) # Edit — change args before running graph.invoke( Command(resume={ "decisions": [{ "type": "edit", "edited_action": { "name": "delete_file", "args": {"path": "/workspace/old-backup.log"} } }] }), config={"configurable": {"thread_id": "session-123"}, ) # Reject — agent receives explanation and stops retrying graph.invoke( Command(resume={ "decisions": [{ "type": "reject", "message": "Do not delete production logs. Archive instead." }] }), config={"configurable": {"thread_id": "session-123"}, ) Multi-Tool Batching If the model calls two tools in the same response, deepagents batches them into a single HITLRequest. One round-trip will handle both: Python # Single interrupt — two pending actions simultaneously graph.invoke(Command(resume={ "decisions": [ {"type": "approve"}, {"type": "reject", "message": "rm -rf is too broad — use a specific path"} ] })) AutoGen v0.4: the UserProxy pattern Installation: Shell pip install autogen-agentchat autogen-ext # Docs: https://microsoft.github.io/autogen/stable/ AutoGen models the human as a UserProxyAgent which is a peer participant in multi-agent conversation. There is no suspension. The loop runs continuously, and the human turn is when the proxy injects a message. AutoGen message-loop HITL Python from autogen_agentchat.agents import AssistantAgent, UserProxyAgent from autogen_agentchat.teams import RoundRobinGroupChat from autogen_agentchat.conditions import TextMentionTermination from autogen_ext.models.openai import OpenAIChatCompletionClient assistant = AssistantAgent( "assistant", model_client=OpenAIChatCompletionClient(model="gpt-4o"), system_message=( "You are a helpful agent. Always describe what you are about to do " "and ask for confirmation before executing file operations." ), ) # input_func is called when the proxy needs human input # Replace `input` with an async queue for web applications user_proxy = UserProxyAgent("human", input_func=input) team = RoundRobinGroupChat( participants=[assistant, user_proxy], termination_condition=TextMentionTermination("DONE"), ) await team.run(task="Clean up old log files in /tmp") Limitation: The conversation loop never truly suspends. If the process exits mid-conversation, the state is lost. For async web UIs, you'd need a background thread and an asyncio queue to bridge human input. There's no built-in checkpointing. Agno: HumanReview on Steps Installation: Shell pip install agno # Docs: https://docs.agno.com/reference/workflows/step Agno's HITL uses a HumanReview config object attached to workflow steps. It supports confirmation gates before execution, user input collection, and post-execution output review: Python from agno.workflow import Workflow, Step from agno.workflow.types import HumanReview from agno.agent import Agent from agno.models.anthropic import Claude extract_agent = Agent(name="Extractor", model=Claude(id="claude-haiku-4-5"), ...) transform_agent = Agent(name="Transformer", model=Claude(id="claude-haiku-4-5"), ...) load_agent = Agent(name="Loader", model=Claude(id="claude-sonnet-4-6"), ...) workflow = Workflow( name="DataPipeline", steps=[ Step(name="extract", agent=extract_agent), Step(name="transform", agent=transform_agent), Step( name="load", agent=load_agent, # Pause and require human confirmation before this step runs human_review=HumanReview(requires_confirmation=True), ), Step( name="verify", agent=load_agent, # Pause after execution for a human to review the output human_review=HumanReview(requires_output_review=True), ), ], ) HumanReview fields: fieldscopewhat it doesrequires_confirmationStep, Loop, Router, ConditionPause before the step executesconfirmation_messageStep, Loop, Router, ConditionCustom prompt shown to the reviewerrequires_user_inputStep, RouterCollect freeform user input before continuingrequires_output_reviewStep, RouterPause after execution; accepts bool or Callable[[StepOutput], bool] for conditional reviewrequires_iteration_reviewLoop onlyReview after each loop iterationon_rejectAllOnReject.skip (default), cancel, or retry (re-run the step with human feedback)on_errorAllOnError.pause triggers HITL on step failure - human decides retry or skiptimeout / on_timeoutAllTimeout in seconds; on_timeout is cancel (default), skip, or approve Resumability: Agno has no workflow-level checkpoint equivalent to LangGraph's checkpointer. If the process exits while a step is awaiting human input, the workflow state is lost. Resumability requires external session storage wired by the caller. OpenAI Agents SDK: needs_approval interrupt Installation: Shell pip install openai-agents # Docs: https://openai.github.io/openai-agents-python/ The OpenAI Agents SDK uses a needs_approval parameter on function_tool. When set, the run loop pauses and surfaces a ToolApprovalItem that the caller approves or rejects via RunState: Python from agents import Agent, function_tool, Runner @function_tool(needs_approval=True) def delete_file(path: str) -> str: """Delete a file at the given path.""" import os os.remove(path) return f"Deleted {path}" # needs_approval can also be a callable for conditional approval @function_tool( needs_approval=lambda ctx, args, call_id: args.get("path", "").startswith("/prod") ) def write_file(path: str, content: str) -> str: """Write content to a file.""" with open(path, "w") as f: f.write(content) return f"Wrote {path}" agent = Agent( name="FileAgent", instructions="Help the user manage files.", tools=[delete_file, write_file], ) async def run_with_approval(): result = await Runner.run(agent, "Delete the old backup file") if result.interruptions: # Convert result to a resumable state, then resolve each pending approval state = result.to_state() for item in result.interruptions: print(f"Approve {item.raw_item.name}({item.raw_item.arguments})? [y/N]: ", end="") if input().strip().lower() == "y": state.approve(item) else: state.reject(item, rejection_message="User rejected this action") # Resume: pass the mutated state back to Runner result = await Runner.run(agent, state=state) print(result.final_output) Limitation: The approval flow is approve-or-reject only. There's no structured "edit" decision type. Humans cannot modify tool arguments through the SDK's approval mechanism. Partial cross-restart resumability is available via state.to_string() / RunState.from_string() and the human is responsible for persisting and restoring the serialized state externally. CrewAI: step_callback + human_input Installation: Shell pip install crewai # Docs: https://docs.crewai.com/en/concepts/crews CrewAI has two distinct mechanisms with very different semantics. step_callback — Observational Only step_callback fires after each agent step and receives an AgentAction | AgentFinish object. It cannot block or modify the next step: Python from crewai import Agent, Crew, Task from crewai.agents.crew_agent_executor import AgentAction, AgentFinish def review_step(step: AgentAction | AgentFinish) -> None: if isinstance(step, AgentAction): print(f"Tool used: {step.tool}, input: {step.tool_input}") elif isinstance(step, AgentFinish): print(f"Agent finished: {step.return_values}") researcher = Agent( role="Researcher", goal="Research the topic", backstory="An expert researcher.", verbose=True, ) task = Task(description="Research quantum computing trends", agent=researcher) crew = Crew( agents=[researcher], tasks=[task], step_callback=review_step, ) crew.kickoff() human_input — Blocking Task-Output Review Setting human_input=True on a Task does produce a real synchronous pause. After the agent finishes its work for that task, execution blocks on input() , and the human can provide free-form feedback before the output is finalized: Python from crewai import Agent, Crew, Task researcher = Agent( role="Researcher", goal="Research the topic", backstory="An expert researcher.", verbose=True, ) task = Task( description="Research quantum computing trends and summarize findings.", expected_output="A summary of the latest quantum computing developments.", agent=researcher, human_input=True, # blocks after agent finishes, before output is accepted ) crew = Crew(agents=[researcher], tasks=[task]) crew.kickoff() # Agent completes its work, then execution pauses: # > Please provide feedback on the agent's output (or press Enter to accept): Key distinction from tool-call-level gates: human_input fires after the agent has already finished the task and all tool calls have already executed. You are reviewing the output, not approving individual actions before they run. The human provides free-form text feedback. There is no structured approve/edit/reject schema, no async queue support, and no state serialization. Because it calls input() directly, it blocks the calling thread, and it is incompatible with async web servers (FastAPI, Starlette) without bridging to a separate thread and queue. Pydantic AI: Deferred Tools Installation: Shell pip install pydantic-ai # Docs: https://pydantic.dev/docs/ai/tools-toolsets/deferred-tools/ Pydantic AI has a first-class HITL primitive called Deferred Tools. Mark a tool with requires_approval=True (or raise ApprovalRequired conditionally) and the agent run terminates with a DeferredToolRequests object instead of a final answer. The caller resolves approvals and resumes with the original message history. Pydantic AI deferred-tool approval sequence Declaring Tools That Require Approval Python from pydantic_ai import Agent from pydantic_ai.exceptions import ApprovalRequired agent = Agent("anthropic:claude-sonnet-4-6") # Always requires approval @agent.tool(requires_approval=True) async def delete_file(ctx, path: str) -> str: import os os.remove(path) return f"Deleted {path}" # Conditional: only requires approval for destructive commands @agent.tool async def run_bash(ctx, command: str) -> str: import subprocess risky = ["rm", "drop", "truncate"] if any(r in command for r in risky): raise ApprovalRequired(metadata={"reason": "destructive command detected"}) return subprocess.check_output(command, shell=True).decode() Handling DeferredToolRequests and Resuming Python from pydantic_ai.tools import DeferredToolRequests, DeferredToolResults, ToolDenied async def run_with_approval(): result = await agent.run("Delete the old backup file") if isinstance(result.output, DeferredToolRequests): approvals = {} for tool_call in result.output.approvals: print(f"Approve {tool_call.tool_name}({tool_call.args})? [y/N]: ", end="") if input().strip().lower() == "y": approvals[tool_call.tool_call_id] = True else: # ToolDenied lets you pass a custom message back to the model approvals[tool_call.tool_call_id] = ToolDenied( message="User rejected this action - do not retry." ) # Resume: pass original message history + approval decisions result = await agent.run( message_history=result.all_messages(), deferred_tool_results=DeferredToolResults(approvals=approvals), ) print(result.output) Inline resolution with HandleDeferredToolCalls For cases where you want to resolve approvals within the same run (e.g., a CLI prompt that doesn't need to persist state), use the HandleDeferredToolCalls capability: Python from pydantic_ai.capabilities import HandleDeferredToolCalls from pydantic_ai.tools import DeferredToolRequests, DeferredToolResults, ToolDenied async def interactive_approver(ctx, requests: DeferredToolRequests) -> DeferredToolResults: approvals = {} for tool_call in requests.approvals: print(f"Approve {tool_call.tool_name}({tool_call.args})? [y/N]: ", end="") if input().strip().lower() == "y": approvals[tool_call.tool_call_id] = True else: approvals[tool_call.tool_call_id] = ToolDenied(message="Rejected by user.") return DeferredToolResults(approvals=approvals) agent = Agent( "anthropic:claude-sonnet-4-6", tools=[delete_file, run_bash], capabilities=[HandleDeferredToolCalls(interactive_approver)], ) Limitation: There is no durable state serialization. If the process exits between the first run (which returns DeferredToolRequests) and the resume, the run cannot be recovered. The caller(human) must persist result.all_messages() and the pending tool call IDs externally. There is also no structured "edit" decision type; the human can approve or deny, but cannot modify tool arguments through the SDK. Choosing the Right Pattern use casebest fit Long-running agent, async human reviewer with durable resume deepagents only Human needs to edit tool args before execution deepagents only Step-level gates with on_reject retry loops, no durable resume Agno HumanReview Conversational co-pilot, real-time back-and-forth AutoGen Approve/reject specific tools, run stays in-process OpenAI Agents SDK Approve/reject specific tools, run terminates for async handling Pydantic AI Deferred Tools Audit logging, no blocking needed CrewAI step_callback Review task output after agent finishes (not tool-call level) CrewAI human_input=True Conclusion There is no universal answer to HITL in agent frameworks. The right choice depends on three questions before choosing your framework: at what granularity does a human need to intervene (tool call, step, or task output), whether the reviewer responds in real time or hours later, and whether you need the process to survive a restart between the interrupt and the resume. If the answer to any of the last two is "yes," deepagents with a LangGraph checkpointer is the only framework that handles both today. For everything else, the landscape is richer than it first appears: Pydantic AI's Deferred Tools give you structured tool-call-level approval without a graph runtime; Agno gives you powerful step-level gates with retry semantics; and OpenAI Agents SDK gives you the simplest possible approve/reject path when you control the process lifecycle. The mistake most teams make is treating HITL as an afterthought. The primitives each framework exposes are not interchangeable, and switching from an observational callback to a durable interrupt requires rearchitecting the execution model, not just swapping a parameter. The decision tree above is meant to surface that choice before it becomes expensive to undo.

By Ninaad Rao

A Fully Self‑Contained Text Embedding Service in C#

Modern semantic search, retrieval-augmented generation (RAG) pipelines, and large-scale recommendation models heavily rely on embeddings — transformations of natural language text into dense numeric representations called vectors. These embeddings position semantically related text in nearby regions of vector space. It enables similarity computation through distant metrices such as Cosine similarity or Euclidean distance. Cloud-hosted services like OpenAI has text-embedding-ada-002 provide high-quality vector encodings. But it comes with API keys, network latency, and per-token usage costs. In contrast, LocalEmbeddingService does all the computation within hosted process, no GPUs, no outbound requests, no model files to manage. The method it uses is called the hashing trick (or feature hashing). The same algorithm is implemented in scikit-learn’s HashingVectorizer. 1. Contract: IEmbeddingService C# public class LocalEmbeddingService : IEmbeddingService { public int Dimensions => 512; The service creates 512-dimensional float vectors. This is intentional. It is large enough to capture document semantics yet small enough for in-memory dot-product similarity searches across millions of vectors. These dimensions can be increased to 1024 or 2048, but will require additional GPU and memory usage. 2. Stop Words C# private static readonly HashSet<string> StopWords = new(StopAnalyzer.ENGLISH_STOP_WORDS_SET, StringComparer.OrdinalIgnoreCase); Stop words are common high-frequency words like “and”, “the”, “is”, and “while”. It does contain minimal/no semantic information, but can heavily influence vectorized output if these are not filtered. In the above code, Lucene.NET’s nuget package is used, instead of hardcoding, which has a predefined set StopAnalyzer.ENGLISH_STOP_WORDS_SET. It is well curated and validated. The set is wrapped in HashSet<string> with OrdinalIgnoreCase which provides fast case-insensitive lookup without any extra allocation at query time. 3. Text Cleaning — Tokenization C# private static Dictionary<string, int> Tokenize(string text) { var freq = new Dictionary<string, int>(StringComparer.OrdinalIgnoreCase); var tokens = text .ToLowerInvariant() .Split(new[] { ' ', '\t', '\n', '\r', ',', '.', '!', '?', ';', ':', '"', '\'', '(', ')', '[', ']', '{', '}', '-', '_', '/', '\\' }, StringSplitOptions.RemoveEmptyEntries) .Where(t => t.Length > 2 && !StopWords.Contains(t)); foreach (var token in tokens) freq[token] = freq.GetValueOrDefault(token) + 1; return freq; } Tokenization is the very first step of text cleaning. Each word has to go through it. It has 3 main things. Lowercasing: It keeps all the words in lower case. “System” and “system” have the same meaning.Split based on delimiter/punctuation: Each delimiter/punctuation is considered as a word boundary. “top-of-the-line” will become [“top”, “line”] after splitting and removing stop words.Filtering: If the tokens are less than 3 characters, then they will be skipped with stop words. After tokenization, it gives a term-frequency map like { "compute": 2, "learn": 3, "embedding": 1, … }. 4. Hashing Trick/Feature Hashing The core challenge here is the size of real-world vocabularies. There are millions of distinct terms. It makes it almost impossible to allocate a separate vector dimension per term/token. Hashing tricks solve this problem by hashing tokens directly into a bounded index range via a hash function. It will eliminate the need to store a vocabulary. C# private static int StableBucket(string token, int size) { unchecked { uint hash = 2166136261u; // FNV offset basis foreach (char c in token) { hash ^= (byte)c; hash *= 16777619u; // FNV prime } return (int)(hash % (uint)size); } } Here FNV-1a (Fowler–Noll–Vo) hash function is used. It is a lightweight, non-cryptographic hash ideal for short strings with excellent bit distribution. It uses two canonical constants. FNV offset basis: Decimal: 2166136261, Hex: 0x811C9DC5FNV prime: Decimal: 16777619, Hex: 0x01000193 Each character is processed by XOR-ing the current hash with the character’s byte value. Then it is multiplied by FNV prime. The XOR-then-Multiply order ensures every byte influences 32 bits, improving avalanche behavior for short tokens like English words. Here .NET’s string.GetHashCode() is not useful because it randomizes per process run against hash flooding attacks. The StableBucket is required to return same bucket indices across every run for deterministic 32-bit results. The use of unchecked in C# ensures overflow checking for 32-bit integer semantics. 5. Log-Based TF Normalization C# float weight = MathF.Log(1f + count); Term frequency does not scale linearly with semantic importance. For example, a word/term that appears 10 times in a document is not actually 10 times more important that the term appears once. When the log log(1 + count) is applied, it compresses the raw frequency. The table below shows how this log-based frequency works. This ensures that no single repeated term disproportionately shapes the embedding, the same reasoning behind TF-IDF in traditional information retrieval systems. 6. Trigram Features for Morphology Capture C# if (token.Length >= 4) { for (int i = 0; i <= token.Length - 3; i++) { string trigram = token[i..(i + 3)]; int trigramBucket = StableBucket(trigram, Dimensions); vector[trigramBucket] += weight * 0.5f; } } Whole world hashing can produce hard edge cases for terms like “play”, “player”, “playing”. These terms are treated as separate features and land in different buckets. Trigrams help to reconnect them and smooth out these gaps. Here are trigrams for “playing” and “player”. C# playing - pla, lay, ayi, yin, ing player - pla, lay, aye, yer Here, common trigrams like pla and lay cause both terms to accumulate weight in some of the same hashed buckets, which pulls their vectors closer in embedding space. The half weight (o.5f ) ensures that trigram features do not dominate the whole-word signal. 7. L2 Vector Normalization — Cosine Similarity via Direct Dot Products C# private static void NormalizeL2(float[] vector) { float magnitude = 0f; foreach (float v in vector) magnitude += v * v; magnitude = MathF.Sqrt(magnitude); if (magnitude > 0f) for (int i = 0; i < vector.Length; i++) vector[i] /= magnitude; } Once all token and trigram weights have been applied, the resulting vector is normalized so that its Euclidean length equals 1. This normalization enables a key mathematical identity: C# cosine_similarity(a, b) = a · b when ‖a‖ = ‖b‖ = 1 When vectors are already L2-normalized, the cosine similarity is evaluated using the raw dot product operation, eliminating the need for any division. 8. Utility: GetTopTokenWeights C# public Dictionary<string, float> GetTopTokenWeights(string text, int topN = 10) { var tokenFreq = Tokenize(text); return tokenFreq .Select(kv => new { Token = kv.Key, Weight = MathF.Log(1f + kv.Value) }) .OrderByDescending(x => x.Weight) .Take(topN) .ToDictionary(x => x.Token, x => x.Weight); } This diagnosis method highlights the tokens that contributed most to the final embeddings. It provides critical insight into why two documents achieve high similarity scores and confirms that the stop word removal and tokenization are working as expected. Limitations and Production Enhancements This service is fully deterministic, fast, and requires zero supporting infrastructure. It performs well for vocabulary-driven similarity — cases where documents share the same vocabulary. It does not encode semantic relationships. For example, “car” and “sedan” will end up in separate buckets and will not have the same similarity score. For production-grade semantic search, LocalEmbeddingService can be replaced with either OpenAI or a local ONNX sentence transformer. The shared IEmbeddingService interface by both implementations ensures that no code change is required for any components like API Controllers, vector index, or retrieval logic. Project repository: TextEmbeddingService

By Mangesh Walimbe

Beyond Static Thresholds: Building Self-Healing Systems via Context-Aware Control Loops

Abstract Modern distributed systems rarely fail in isolation — they degrade across multiple execution steps. This article presents a control-loop-based architecture for building self-healing systems that detect anomalies early, precisely isolate failures, and automatically recover using context-aware decisions. Introduction Modern distributed systems are large-scale platforms built on service-oriented architecture. In such systems, an individual request — the unit of execution — typically flows through multiple services, including clients (request initiators), orchestrators, enrichment layers, validation or policy-evaluation systems, routing layers, downstream dependencies, state management systems, reconciliation processes, and notification systems. Each service in this chain introduces latency, retries, dependencies, and failure modes. Because of this, failures in distributed systems rarely appear as clean, isolated events. Instead, they emerge as a sequence of interacting issues that create a cascading effect across the system. For example, a downstream dependency may become slow in a specific region. This increases retries, which in turn increases queue depth. The growing queue depth puts pressure on the orchestrator, eventually causing it to fail unrelated requests due to resource saturation. What initially was a local dependency problem rapidly turned into a widespread degradation of workflow. This problem is particularly difficult in asynchronous systems, where failures are not always instantly visible. A request may not fail instantly — it may remain pending, miss its expected execution window, be delayed in execution, get stuck in an intermediate state, or lose coordination between system components. When the operator detects the issue, the impact could already be large enough. However, traditional protection mechanisms such as fixed failure thresholds, static alerts, and global circuit breakers are often too coarse-grained for these scenarios. A localized dependency failure should not halt the entire system. At the same time, localized issues must not be allowed to trigger storms or cascade into otherwise healthy execution paths. The goal, therefore, is to build a self-healing control system that can detect anomalies at the level of individual requests, aggregate signals across execution and system dimensions, isolate only the affected scope, and recover gradually based on real-time evidence. This post presents such a system. It is designed to provide predictive anomaly detection, hierarchical aggregation, scoped and global kill switches, adaptive leaky-bucket flow control, observability, and AI-assisted investigation and escalation. featurestatic thresholds (old way)context-aware loops (new way)DetectionStatic ThresholdingPredictive Anomaly DetectionContainmentGlobalScopedControlBinary ShutdownAdaptive Flow ControlRecoveryManualEvidence-Based Self-Healing Why Traditional Systems With Static Thresholds Won’t Work Most distributed systems rely on mechanisms like retries, dead-letter queues, alerts, and circuit breakers. These are useful but not enough for complex async workflows as they depend on static thresholds, which are context-blind by nature. A rule like “trigger an alert when failures exceed X%” cannot distinguish between fundamentally different types of failures: Logical failures, where a request completes but produces an incorrect result due to issues in input, configuration, or application logic Execution failures, where a request produces no result due to delays, retries, or loss of coordination across system components For example, in an AI inference system, a request may return an incorrect response due to model configuration issues (logical failure), or it may be accepted but never complete due to stalled execution in downstream components (execution failure). Static thresholds treat both cases uniformly, even though they require very different responses. As a result, systems either overreact to expected failures or miss critical anomalies such as stuck or silently failing requests. Failure volume alone is also a weak signal. A small number of failures could be highly significant if those requests were anticipated to be successful. For instance, if requests following the same execution path have historically resulted in high reliability, even a few failures in that cohort can imply a serious issue. Static thresholds also lack scope awareness. A local failure example, requests routed through a particular execution path, dependency, or region, should not cause a global shutdown. However, a pattern of small anomalies across different paths, regions, or request classes could indicate a larger systemic problem, even if no single threshold is crossed. For instance, in an inference system, requests served by a specific model variant may observe increased latency or degraded outputs due to recent changes to configurations or parameters, while other models and request paths continue to function normally. These limitations are amplified in asynchronous systems, where failures are not always specific. Coordination gaps can cause requests to be stuck, delayed, retried multiple times, or enter into inconsistent states. This leads to higher latency, missed completion signals, or repeated retries with no progress. These weaknesses are further revealed during recovery. AI Agents or operators have to manually inspect logs and dashboards to determine when to resume traffic, resulting in inconsistent performance, slowness, and reactive recovery. In summary, these challenges demonstrate that static thresholding is not sufficient for modern distributed systems. What is needed is a system that understands request context, expected behavior, and the scope of the anomaly. This leads to a fundamental shift in system design: Static thresholding → Predictive anomaly detection Global containment → Scoped containment Binary shutdown → Adaptive flow control Manual recovery → Evidence-based self-healing Instead of asking: Are requests failing? The system should ask: Are requests behaving as expected within their defined SLA, given their execution context and expected outcomes? System Architecture as a Control Loop The system functions as a control loop during request execution. It does not replace the execution path. Instead, it constantly monitors the system's behavior, predicts expected outcomes, identifies deviations, and makes control decisions based on real-time signals. Orchestrated Execution With Continuous Monitoring A primary orchestrator drives the system. It executes each request through a series of steps. At each step, the orchestrator calls on one or more downstream systems, either synchronously or asynchronously. These downstream systems may have their own dependencies. As the request moves forward, it carries contextual metadata like tenant class, region, request type, execution path, and routing decisions. This context defines how the request should behave at each step or at a specific point. While the orchestrator manages execution, anomaly detection serves as a continuous control layer throughout these steps. It tracks the outcome of each phase to ensure that the request moves forward as expected and that the contextual integrity remains intact. Context Preservation and Signal Collection At every step, the system captures signals such as latency, retries, routing decisions, execution status, and downstream responses. It also augments the request with derived attributes such as execution path identifiers and historical behavior patterns. This ensures that each request is evaluated relative to similar cohorts, and more importantly, allows the system to identify where deviations occur within the execution flow — not just whether the request ultimately fails. Success Prediction Engine Intuition: The system learns what 'normal' looks like for similar requests and uses that to estimate expected outcomes. The system estimates how likely a request is to succeed based on its context and historical behavior. For each request i, the expected success is computed as: Plain Text P_i = P(success | x_i) Where: x_i = request features (context, routing path, system state) P_i = expected probability of success This establishes what should happen at different stages of execution, allowing the system to detect deviations between expected and actual outcomes throughout the request lifecycle. Step-Level Anomaly Detection Unlike traditional systems that evaluate only final success or failure, this system continuously monitors each critical step of execution. A request may: Be accepted but delayed Be routed to an unexpected path Experience retries at a specific step Produce degraded output Fail to progress beyond a step By evaluating these signals against expected behavior for that request’s context, the system can detect anomalies early and pinpoint the exact step where deviation occurs. Inference Example (Grounding) For example, in an inference system, the orchestrator can direct a request from a certain tenant class to a summarization model in a certain subnet of a region. If that subnet/region experiences network latency, requests may still be accepted and processed, but exhibit higher latency or delayed responses. In this case, the orchestrator continues execution, but a specific step — model execution in that region — is deviating from expected behavior. Other models or regions may continue to function normally. Hierarchical Roll-up Counters The hierarchical roll-up model aggregates anomalies across multiple contextual dimensions. When a request deviates from expected behavior at any step, the system updates counters across relevant dimensions such as dependency, execution path, tenant class, and region. Example roll-ups: Plain Text (dependency, request_type) (dependency,request_type, tenant_class) (dependency, region) (execution_path, request_type) (global) A single anomalous request may update multiple roll-ups simultaneously. For example, a request routed to a summarization model in a latency-affected region may update: Plain Text (summarizer_model, tenant_class_A, region_us_west) (summarizer_model, region_us_west) (summarizer_model, tenant_class_A) (global) This multi-dimensional view allows the system to isolate issues precisely while still capturing broader systemic patterns. Roll-Up Configuration Model Each roll-up is independently configurable, allowing the system to adapt thresholds and behavior based on the criticality of different execution paths and request classes. Example configuration: JSON { "roll-up_id": "dependency_request_type_region", "dimensions": ["dependency", "request_type", "region"], "threshold": 25, "tumbling_window": "30m", "parent_roll-up_ids": [ "dependency_region", "dependency_request_type", "dependency", "global" ], "control_action": "HOLD_AND_PROBE" } Key Fields dimensions → define how the rollup key is constructed threshold → anomaly count required to trigger tumbling_window → fixed evaluation window (e.g., 30 minutes) parent_rollup_ids → defines relationships across rollups control_action → action applied when this rollup becomes the resolved scope Hierarchical Rollup Model (DAG) The hierarchy is modeled as a directed acyclic graph (DAG). This allows a granular rollup to contribute to multiple parent views. For example: Plain Text (dependency=D1, request_type=TYPE_A, region=EU) → (dependency=D1, region=EU) → (dependency=D1, request_type=TYPE_A) → (dependency=D1) → (global) A single anomalous request may update multiple rollups simultaneously, including both child and parent scopes. Rollup Runtime State At runtime, each rollup key maintains its own state within a tumbling window: Plain Text Rollup: (dependency, region) Key: D1:EU Window: 30 mins Anomaly Count: 35 Threshold: 25 → FIRED Each rollup evaluates independently: A child rollup may fire without the parent firing A parent rollup may fire when anomalies are distributed across multiple children Parent Roll-up Escalation Guard Since parent roll-ups aggregate signals, the system must prevent escalation caused by a single noisy child. Instead of maintaining a full child-level state, each parent tracks lightweight signals: parent_anomaly_countimpacted_child_countmax_child_contribution_ratio A parent roll-up is considered impacted only when: Plain Text parent_anomaly_count >= parent_threshold AND impacted_child_count >= min_required_children AND max_child_contribution_ratio <= max_allowed_ratio Example: Do not escalate at the parent level if only the request Type_A is failing. Plain Text TYPE_A = 100 anomalies TYPE_B = 0 TYPE_C = 0 Parent count = 100 Impacted children = 1 → Keep control at child level Example: Escalate. Plain Text TYPE_A = 40 TYPE_B = 35 TYPE_C = 25 Parent count = 100 Impacted children = 3 → Escalate to parent scope Why This Matters This ensures: Localized issues remain scoped Distributed anomalies are escalated correctly. Noisy signals do not trigger unnecessary global actions Anomaly Detection Engine The anomaly detection engine identifies unexpected deviations by comparing predicted outcomes and actual results and propagates these signals to rollup counters. A request is marked anomalous only if it was expected to succeed but deviates from expected behavior: Plain Text Anomaly_i = 1 if P_i ≥ τ AND Y_i deviates from expected outcome Where: Pi = predicted success probability Yi = observed outcome (failure, delay, degraded output, etc.) Each anomalous request updates multiple rollups across dimensions such as dependency, region, request type, and tenant class. The system evaluates all rollups that breach their thresholds and resolves the appropriate control scope. It then: Deduplicates overlapping signals Selects the highest meaningful level in the hierarchy Avoids redundant or conflicting controls This ensures: Localized issues remain scoped Correlated anomalies are elevated appropriately Duplicate control actions are avoided Kill Switch Controller The kill switch controller enforces control actions at the resolved anomaly scope. Based on severity and scope, it determines whether to: Stop new incoming requests within the scope Hold in-progress requests before critical downstream steps Allow controlled traffic via throttling or probing Control Actions Plain Text ALLOW → continue processing HOLD → pause new and in-progress requests THROTTLE → limit request rate PROBE → allow controlled traffic REROUTE → send via alternate path ESCALATE → trigger alerts / human intervention The controller applies actions consistently across the resolved scope, ensuring full containment without partial or conflicting behavior. Adaptive Recovery Strategy Once a control action is applied, the system does not immediately resume normal traffic. Instead, it gradually reintroduces traffic using a probing strategy. For example: Plain Text Step 1: allow 1 request Step 2: if successful (actual outcome == predicted outcome, allow 2 Step 3: if stable, allow 5 Step 4: gradually increase Step 5: if failures reappear, reduce or stop Recovery is guided by: Plain Text Recovery_G = Successful_G / Released_G Where: G = impacted roll-up scope This ensures: Safe and gradual recovery Avoidance of sudden failure spikes Validation of real system behavior Observability and Audit Layer The system captures all signals across execution: Predicted outcome Actual outcome Anomaly classification Impacted rollups Resolved scope Control action Recovery state These signals provide visibility into: Anomaly trends Active control scopes Held vs released requests Recovery progress This ensures full transparency, debuggability, and auditability. AI Control Plane The AI control plane operates outside the execution path and complements deterministic control logic. It consumes: Anomaly signals Roll-ups Deployment changes System health Control decisions It performs: Investigation → correlates anomalies with systems or changes Automated remediation → triggers safe rollback Escalation → notifies relevant teams Summarization → generates incident insights Key Separation Plain Text Decision Plane → deterministic (prediction, anomaly detection, control) AI Control Plane → intelligent (analysis, remediation, escalation) Conclusion Modern distributed systems cannot rely on static thresholds and reactive controls. Failures are often contextual, asynchronous, and distributed across multiple execution paths. This architecture introduces a fundamental shift: From failure counting → context-aware detection From global shutdown → scoped containment From reactive response → adaptive, evidence-based recovery By combining prediction, hierarchical rollups, scoped control, and adaptive recovery, the system can precisely isolate deviations, minimize impact, and restore stability safely. The core idea is simple but powerful: Systems should not just detect failures — they should continuously understand system behavior, localize deviations in context, and adapt in real time to maintain reliability. What’s Next: From Architecture to Code Designing the architecture is only the first step. In the next post, we move from the blueprint to the technical implementation, diving deep into: The State Machine: Managing high-cardinality counters without latency and affecting execution path.The Escalation Guard: Pseudo-code to prevent "noisy neighbor" failures.Adaptive Recovery: The logarithmic logic for safe traffic re-introduction. Stay tuned for the implementation deep-dive. Case Study: Applying the Control Loop to a Multi-Region Inference System End-to-end Example: Inference system with scoped control and adaptive recovery This example illustrates how anomalies propagate, how scope is resolved, and how control and recovery are applied in an inference system. Step 1: Incoming Requests Requests are routed by the orchestrator to model services in the DUB region: Plain Text (model=summarizer_v2, tenant_class=A, region=DUB) (model=translator_v1, tenant_class=A, region=DUB) (model=qa_model_v3, tenant_class=A, region=DUB) Predicted success: Pi≈0.95+ Step 2: Deviations → Anomalies Due to network degradation in DUB, requests begin to show: increased latency delayed responses occasional degraded outputs Yi deviates and Pi≥τ⇒Anomalyi=1Y_i \text{ deviates and } P_i \geq \tau \Rightarrow Anomaly_i = 1. Step 3: Roll-up Updates Each anomalous request updates multiple rollups: Plain Text (summarizer_v2, tenant=A, DUB) → 40 (translator_v1, tenant=A, DUB) → 35 (qa_model_v3, tenant=A, DUB) → 25 (region=DUB) → 100 Step 4: Parent Escalation Guard Plain Text parent_count = 100 impacted_child_count = 3 max_child_ratio ≈ 40% Since anomalies are distributed across multiple models, not concentrated in one: Plain Text → Escalate to (region=DUB) Step 5: Impact Resolution Fired roll-ups: Plain Text (summarizer_v2, tenant=A, DUB) (translator_v1, tenant=A, DUB) (qa_model_v3, tenant=A, DUB) (region=DUB) Resolved scope: Plain Text (region=DUB) Child rollups are de-duplicated and consolidated under the parent scope. Step 6: Control (Scoped Isolation + Reroute + Local Probing) Action: Plain Text HOLD_AND_PROBE + REROUTE Effect: Throttle or hold most requests routed to DUB Reroute the majority of traffic to FRA only after verifying that the region has sufficient available capacity and is operating within stable limits.Allow a small number of low-impact requests to continue via DUB as probes These probe requests validate whether the issue is transient or persistent without exposing the system to large-scale risk. Step 7: Adaptive Recovery Traffic is managed dynamically: Plain Text DUB (probe path): 1 → 2 → 5 → gradual increase FRA (rerouted path): handles majority of traffic Recovery signal: RecoveryG = SuccessfulGReleasedGRecovery_G = \frac{Successful_G}{Released_G} If probe requests via DUB succeed → gradually restore DUB traffic If failures persist → continue routing to FRA and reduce DUB probes Step 8: AI Control Plane Based on observed signals: Regional network issue → continue routing to FRA Model deployment issue → rollback model version Infrastructure saturation → rebalance across regions Transient degradation → generate summary without escalation Key Takeaways Failures are localized but distributed across modelsControl is applied at the correct scope (region-level)System avoids global shutdownRecovery is validated through controlled probingTraffic is dynamically rerouted and restored The system does not simply stop traffic-it isolates the impacted scope, reroutes intelligently, and verifies recovery through controlled probing before storing normal behavior.

By Darshan Botadra