If You Can Facilitate a Retrospective, You Can Audit Your AI
Can Rust Have Zero-Cost Dependency Injection?
Code Review Core Practices
Shipping Production-Grade AI Agents
Every React developer reaches a point where the sheer volume of boilerplate starts to slow them down. Prop drilling, repetitive hook patterns, component scaffolding, unit test setup — the cognitive overhead adds up fast, especially at enterprise scale. When GitHub Copilot entered my workflow, I expected a productivity boost. What I didn't expect was how much I'd have to think about using it correctly. After integrating AI-assisted development into a React 18 codebase — spanning custom hooks, context-based state management, and accessibility-driven UI — I came away with a clear picture of where AI genuinely accelerates the work, where it quietly introduces risk, and what guardrails every team needs before they ship AI-assisted code to production. This isn't a tutorial on setting up Copilot. It's an honest account of what changed in my day-to-day React workflow, and how I rebuilt my development process around the strengths of AI without surrendering architectural judgment. Where AI Actually Accelerates React Development 1. Component Scaffolding The most immediate win was generating boilerplate-heavy component shells. React functional components follow a predictable structure: imports, props interface, state declarations, effect hooks, render return. Copilot autocompletes this structure accurately and fast, especially when your file already has consistent patterns. For example, starting a new form component with a comment like: Plain Text // Controlled form component with validation and submit handler … triggers a usable scaffold within seconds. In a codebase with 50+ form components, this adds up to meaningful time savings. 2. TypeScript Prop Typing One of the most tedious parts of React 18 development is defining interface types for component props — especially for components consuming API response shapes. Copilot handles this well when the API shape is already defined elsewhere in the file or project. It infers prop types from usage context and generates clean interfaces without much guidance. 3. Unit Test Generation Copilot shines at generating @testing-library/react test cases for presentational components. Given a component file, it can suggest: Render testsUser interaction tests (click, input change)Accessibility checks using getByRole This reduced the time I spent on repetitive test scaffolding by roughly 40% for simple components. 4. Repetitive Hook Patterns Standard hooks like useEffect with cleanup, useCallback with dependency arrays, and useMemo for expensive computations follow well-known patterns. Copilot autocompletes these reliably — and the suggestions are often correct on the first try when the surrounding context is clear. Where AI Fails React Developers (and Why It Matters) This is the part most AI-workflow articles skip. In my experience, Copilot introduced subtle issues in three specific areas: 1. State Management Architecture Copilot is pattern-matching, not reasoning. When I was designing a context-based global state solution for a multi-step form flow, Copilot consistently suggested patterns that worked for isolated examples but didn't scale: it created redundant useContext calls across components that should have been wrapped in a provider, and it failed to account for re-render performance implications. The lesson: Never accept AI suggestions for state architecture without reviewing the component tree. AI optimizes locally; architecture requires global thinking. 2. Custom Hook Dependency Arrays Incorrect dependency arrays in useEffect and useCallback are a well-known React footgun. Copilot's suggestions here were hit-or-miss. It occasionally omitted dependencies that needed to be included and included stale values that triggered unnecessary re-renders. I started treating all AI-generated dependency arrays as drafts that required manual review against the ESLint react-hooks/exhaustive-deps rule. This step is non-negotiable. 3. Accessibility in JSX This one is subtle. Copilot generates functional JSX — but accessible JSX requires deliberate attention to ARIA roles, focus management, and semantic HTML. AI-generated components often defaulted to div-heavy markup without the aria-* attributes or keyboard event handlers that production apps require. For any component touching user interaction — modals, dropdowns, form controls — I reviewed AI-generated output against WCAG 2.1 AA standards before committing. My Rebuilt Workflow: A Practical Stack After months of iteration, here's the workflow that works: Phase 1: Design First, Prompt Second Before I open a new file, I sketch the component's responsibilities on paper or in a comment block: JavaScript /** * UserProfileCard * - Displays user avatar, name, role * - Supports edit mode toggle * - Emits onSave callback with updated values * - Must be keyboard accessible */ This comment becomes the Copilot context. The more specific the intent, the better the scaffold. Phase 2: Accept Scaffolding, Write Logic I accept Copilot suggestions for: Component shellProp interfaceState variable declarationsJSX structure for simple layouts I write manually: useEffect logic and cleanupEvent handler implementationsContext provider designError boundariesAny business logic touching API data Phase 3: Review AI-Generated Tests Copilot generates test scaffolding well. I review every generated test for: Correct use of userEvent vs fireEventAccurate assertions (not just "it rendered")Missing edge cases (empty state, error state, loading state) Phase 4: Accessibility Audit Pass Every component gets a final pass against: Semantic HTML element usagearia-label / aria-describedby for interactive elementsKeyboard navigation (tab order, focus trap for modals)Color contrast (handled at design system level, not component level) A Real Before-and-After Example Before (pre-AI workflow): A controlled input component with validation took roughly 25–30 minutes to scaffold, type, test, and review. After (AI-augmented workflow): The same component takes 10–12 minutes — with Copilot handling the initial scaffold and test shell, and me handling the validation logic, hook dependencies, and accessibility pass. Here's a simplified example of the kind of component where AI delivers the most value: TypeScript interface SearchInputProps { value: string; onChange: (value: string) => void; onSubmit: () => void; placeholder?: string; isLoading?: boolean; } const SearchInput: React.FC<SearchInputProps> = ({ value, onChange, onSubmit, placeholder = "Search...", isLoading = false, }) => { const handleKeyDown = (e: React.KeyboardEvent<HTMLInputElement>) => { if (e.key === "Enter") onSubmit(); }; return ( <div role="search"> <input type="search" value={value} onChange={(e) => onChange(e.target.value)} onKeyDown={handleKeyDown} placeholder={placeholder} aria-label="Search" disabled={isLoading} /> <button onClick={onSubmit} disabled={isLoading} aria-label="Submit search"> {isLoading ? "Searching..." : "Search"} </button> </div> ); }; The scaffold, prop interface, and JSX structure above were AI-generated in under 30 seconds. The aria-label attributes, role="search", and handleKeyDown implementation were my additions — things Copilot consistently missed in initial suggestions. Where AI Hits a Wall: Large-Scale Enterprise React Projects Small, isolated components are where AI shines. But real enterprise codebases are rarely small or isolated. Once you're working inside a large monorepo with hundreds of components, shared design systems, domain-specific business logic, and cross-team API contracts, AI-assisted development runs into a fundamental limitation: it only sees what's in its context window. Here's where that breaks down in practice: 1. Cross-File Dependency Awareness In a large React application, a single component may depend on a shared context provider defined four directories away, a utility hook maintained by a different team, and a TypeScript type exported from a core domain package. Copilot's autocomplete works within the file you're editing — it doesn't have a deep understanding of the full dependency graph. The result: AI-generated code that compiles locally but breaks at integration because it assumes a prop shape, import path, or context value that doesn't match what actually exists in the broader system. I've seen this surface most often with shared form validation schemas and API response types that live outside the component's immediate file tree. 2. Institutional Knowledge and Business Logic Enterprise React codebases carry years of intentional decisions that aren't documented anywhere in the code — they live in the heads of the team. Why is this particular component wrapped in a custom error boundary? Why does this dropdown use a local state copy instead of reading directly from context? Why is this API called twice? Copilot has no way of knowing. When it generates code in these areas, it produces something that looks reasonable but violates the implicit contract the team has built over time. Catching these violations requires a senior developer who understands the why behind the existing patterns — AI cannot substitute for that. 3. Design System Consistency at Scale Large teams typically maintain a shared component library — think an internal fork of Material UI or a custom design system. AI tools don't know which internal components to reach for. Copilot frequently suggests raw HTML elements or third-party components when the project has established internal equivalents: <Button> from your design system instead of <button>, <TextInput> from your library instead of a raw <input>. At scale, this creates design debt fast. Every AI-generated component that uses a raw HTML element instead of the design system equivalent is a component that diverges from your visual and behavioral standards — and accumulates technical debt that's expensive to audit later. 4. Performance Optimization in Complex Component Trees React 18 introduced useDeferredValue, useTransition, and concurrent rendering features specifically to handle performance in large, deeply nested component trees. These are nuanced APIs — their correct usage depends on understanding the rendering priority of specific subtrees, which operations are expensive, and what the user experience should be during transitions. Copilot-generated code in this area is almost always naive. It doesn't know that a particular list component renders 500+ items and needs virtualization. It doesn't know that a specific state update should be wrapped in startTransition to keep the UI responsive. Optimizing a large React application for performance remains deeply human work. 5. Multi-Team Merge Conflicts and Shared State In enterprise projects with multiple teams contributing to the same React codebase, shared state management becomes politically and technically complex. Redux slices, Zustand stores, or React Query caches span team boundaries. AI tools can suggest changes to these shared structures without awareness of how other teams depend on them — leading to breakages that only surface in integration environments. The practical takeaway: the larger and more interconnected the codebase, the more you need to treat AI as a localized assistant, not a system-aware collaborator. Use it to accelerate work on leaf-node components and isolated utilities. Treat any AI suggestion that touches shared state, cross-team APIs, or core infrastructure with the same scrutiny you'd give an external contributor who just joined the project. If you're introducing AI-assisted development into a React team, here are the non-negotiables: 1. Never merge AI-generated code without lint and type checks passing. Run eslint, tsc --noEmit, and your test suite before treating any AI-generated file as complete. 2. Establish a "no AI for architecture" rule. Component tree design, context structure, routing decisions, and data fetching strategy should be human-driven. AI is a code accelerator, not an architect. 3. Code review AI-generated PRs with extra scrutiny. Reviewers should specifically look for: missing hook dependencies, over-broad useEffect triggers, missing accessibility attributes, and logic that "looks right" but doesn't account for edge cases. 4. Document what AI touched. Some teams are beginning to tag AI-assisted code in commit messages or comments. This creates accountability and helps reviewers calibrate their scrutiny. 5. Keep your feedback loop active. When Copilot generates something wrong, reject it explicitly rather than accepting and editing. This helps calibrate your own pattern recognition for what AI does and doesn't handle well. What's Coming Next: Agentic React Workflows The current state of AI in React development is assistive — it completes what you start. The next wave is agentic: AI agents that can take a design spec or Figma export, scaffold an entire component hierarchy, wire up state, and generate test coverage — with a human reviewing the output rather than writing it line by line. Early tools like Cursor's Composer mode and experimental GitHub Copilot Workspace are beginning to move in this direction. For React developers, the implication is a shift in the skill that matters most: from writing components quickly to reviewing and evaluating AI-generated component systems critically. The developers who will thrive in this environment are those who deeply understand React's rendering model, state management tradeoffs, and accessibility requirements — not because they're writing every line, but because they're the final judgment layer on what ships. Conclusion AI-augmented development isn't about replacing React expertise — it's about redirecting it. The hours saved on scaffolding and boilerplate are hours you can reinvest in architecture, performance, accessibility, and code quality. The key insight from rebuilding my workflow around GitHub Copilot is this: AI is a force multiplier for what you already know well. If you understand React deeply, it makes you faster. If you're still learning React's mental model, it can quietly introduce patterns that seem right but aren't. Used with clear guardrails and deliberate review habits, AI turns a good React developer into a significantly more productive one — without sacrificing the code quality that enterprise applications demand.
Modern applications deal with massive amounts of text — support tickets, CRM notes, blog posts, meeting transcripts, and internal documentation. The problem isn’t access to information anymore — it’s how quickly users can understand it. In our CRM system, we allow publishing long-form articles to a blog. However, users rarely want to read everything up front. To solve this, we introduced AI-powered summarization to generate short, readable previews. This improves: Content scanabilityUser engagementTime-to-information In this article, we consider: What text summarization isWhy AI summarization is powerfulSetting up OpenAI in RailsImplementing a summarization serviceBuilding a controller endpointHandling long documentsBackground processing with SidekiqReal-world use cases What Is Text Summarization? Text summarization is the process of condensing a large body of text into a shorter version while preserving its key information. There are two main approaches: 1. Extractive Summarization This selects the most important sentences directly from the original text. Example: Original: Ruby on Rails is a powerful web framework designed to make programming easier by favoring convention over configuration. Summary: Ruby on Rails is a web framework that simplifies development. 2. Abstractive Summarization This generates new sentences that capture the meaning of the text. This is where large language models like OpenAI shine. Why Use LLMs for Summarization? Traditional NLP methods struggle with context and nuance. OpenAI models can provide: Contextual understandingMulti-paragraph reasoningDomain adaptabilityNatural-sounding summaries This makes them ideal for summarizing in a project with different types of information processes: Blog postsDocumentsMeeting transcriptsCustomer feedbackKnowledge bases Setting Up OpenAI in a Rails project 1. First of all, install the OpenAI Ruby Gem. Add the gem: Ruby gem "ruby-openai" 2. Configure the API key. Add your API key to environment variables: Ruby export OPENAI_API_KEY="your_api_key" 3. Example initializer: Ruby OpenAI.configure do |config| config.access_token = ENV["OPENAI_API_KEY"] end Creating a Summarization Service In Rails, the best practice is to encapsulate OpenAI logic in a separate service object. Simple example: Ruby module Openai class Summarizer def initialize(text) @text = text @client = OpenAI::Client.new end def call response = @client.chat( parameters: { model: "gpt-4.1-mini", #or you can select another one messages: [ { role: "system", content: "You are a helpful assistant that summarizes text concisely." #you can define content with more detailed prompt }, { role: "user", content: "Summarize the following text:\\n\\n#{@text}" #also here define more detailed expected response } ], temperature: 0.3 } ) response.dig("choices", 0, "message", "content") end end end Possible roles: Plain Text Role Purpose system - > instructions for the model user - > input from the user assistant - > previous AI responses Usage: Ruby summary = Openai::Summarizer.new(article.content).call Example: Before and After Input (CRM article excerpt):Our platform allows teams to manage projects, track time, and generate reports across multiple departments... Output (AI summary): Centralized platform for project and time trackingSupports multi-department workflowsProvides reporting and analytics tools This summary can be shown in: Blog preview cardsTooltipsSearch results Example Controller Endpoint Now we expose this functionality via an API endpoint. Controller example: Ruby class Api::SummariesController < ApplicationController def create text = params[:text] summary = Openai::Summarizer.new(text).call render json: { summary: summary } end end Temperature controls randomness in the output. Plain Text 0.0 → deterministic 1.0 → very creative Additional Useful Parameters Also added parameters for improving summarization. 1. max_tokens: Limits the size of the generated response. Example: Ruby max_tokens: 200 #This prevents extremely long outputs. 2. top_pAlternative randomness control. Example: Ruby top_p: 0.9 #Usually you adjust temperature or top_p, not both 3. frequency_penalty discourages repeated phrases. Example: Ruby frequency_penalty: 0.2 #Useful when summaries become repetitive 4. presence_penalty encourages introducing new ideas. Example: Ruby presence_penalty: 0.1 #Not usually necessary for summarization, but can be used in specific tasks Prompt Engineering for Better Summaries and Why It Is Important The prompt design significantly impacts the output quality. Instead of a generic prompt: Plain Text "Summarize this text" Use structured instructions: Plain Text "Summarize the following text in 3 bullet points. Focus on the key ideas and avoid unnecessary details." This simple change improves clarity, consistency, and usefulness of the generated summary. In practice, prompt design becomes even more important when working with different types of content, such as technical documentation, CRM notes, etc. I wrote more about the features of Prompt Engineering on practical examples in my other article. Example of a more structured prompt: Ruby { role: "user", content: <<~PROMPT Summarize the following article in 5 bullet points. #{@text} PROMPT } Handling Very Long Documents LLMs have token limits, so large texts must be processed in chunks. Typical approach looks like this: Split text into chunksSummarize each chunkCombine summariesGenerate a final summary Avoid Naive Chunking This is not ideal: Ruby text.scan(/.{1,3000}/m) It may cut sentences in half. Prefer: Splitting by paragraphsSplitting by sentence boundaries Text Chunking Ruby class TextChunker def self.chunk(text) text.split("\\n\\n") end end Chunk Summarization Ruby def summarize_long_text(text) chunks = TextChunker.chunk(text) partial_summaries = chunks.map do |chunk| Openai::Summarizer.new(chunk).call end Openai::Summarizer.new(partial_summaries.join("\\n")).call end Using Background Jobs for Summarization Summarizing large text can take some time, so it’s better to process it asynchronously. Example of our service usage with Sidekiq: Generate Summary Job Ruby class GenerateSummaryJob include Sidekiq::Job def perform(article_id) article = Article.find(article_id) summary = Openai::Summarizer.new(article.content).call article.update!(summary: summary) end end Error Handling Always assume external APIs can fail. Ruby rescueStandardError=>e Rails.logger.error(e.message) fallback_summary end Also consider: RetriesTimeoutsMonitoring Cost Optimization When you use AI features in production, cost management becomes critical. depends primarily on token usage, meaning the more text you send and receive, the more you pay. Some tips for cost optimization that you need to know: 1. Limit Input Size The most effective optimization is reducing the amount of text sent to the AI model. Instead of summarizing an entire document, you can: Extract relevant sectionsSummarize those sections only Example filtering before sending to OpenAI: Ruby class TextPreprocessor MAX_LENGTH = 5000 def self.clean(text) text.strip[0...MAX_LENGTH] end end Usage: Ruby clean_text = TextPreprocessor.clean(article.content) summary = Openai::Summarizer.new(clean_text).call This ensures you never send extremely large inputs. 2. Choose the Right Model Not every task requires the most powerful model. For summarization, smaller models often perform well. Example: Ruby model: "gpt-4.1-mini" Advantages: Much cheaperFaster responsesGood summarization quality So, use larger models only for complex reasoning tasks. 3. Token Counting Before Requests Sometimes the text is larger than expected. Using a token estimation step helps prevent sending oversized prompts. Example: Ruby def too_large?(text) text.length > 12000 end If too large: chunk text into smaller chunkssummarize in parts (chunks) Conclusion Throughout this article, we built a summarization pipeline in Rails using a clean service-oriented approach: Simple summarization servicePrompt optimizationChunking for large documentsBackground processing with SidekiqCost and reliability improvements
Most engineers learn these laws the hard way. When you try to rewrite something and it doesn’t deliver, or when a project is already late, adding engineers to the team will just make it fail faster. Sometimes, when you start using a metric to measure progress, the whole team will start trying to manipulate it. Then, six months later, someone mentions a 1975 law that addresses exactly what happened. I paid a price to learn this, too: I spent half my career learning these lessons the hard way, as many others probably did. The twenty laws listed below are the ones I refer to most often, although there are more (more on this later). Software development laws explain what is happening, what is about to happen, and what will not work no matter how hard you try. Some of these laws are sixty years old. They still apply to software development in 2026, and they will still apply in 2036 because they are not really about software. They are about people working together to build things under time pressure (basically, a lot of them are just laws of human nature). These laws are not rules that tell you what to do. They tell you what is already happening, but you still have to make the decisions. These laws just help you understand what is going on. Each of these laws made the list because I have experienced them myself. My book covers all fifty-six laws. If you only have time to remember twenty software development laws, these are the ones that I think are important. In particular, we will talk about the following laws: Gall’s Law: A complex system that works is always built from a simple system that worked first.KISS: Keep it simple. Anything beyond that is overhead.Conway’s Law: Organizations design systems that mirror their communication structure.Hyrum’s Law: With enough users, every observable behavior of your API becomes someone’s dependency, no matter what the contract says.CAP Theorem: A distributed system can guarantee only two of: consistency, availability, and partition tolerance.Zawinski’s Law: Every program expands until it can read mail. The ones that cannot are replaced by ones that can.Brooks’s Law: Adding people to a late software project makes it later.Ringelmann Effect: Individual output drops as team size goes up.Price’s Law: Half the work is done by the square root of the people.Dunning-Kruger Effect: The less you know about something, the more confident you tend to be.Hofstadter’s Law: It always takes longer than you expect, even when you account for Hofstadter’s Law.Parkinson’s Law: Work expands to fill the time available.Goodhart’s Law: When a measure becomes a target, it stops being a good measure.Gilb’s Law: Anything you need to quantify can be measured in some way that beats not measuring it.Knuth’s Optimization Principle: Premature optimization is the root of all evil.Amdahl’s Law: The speedup from parallelism is limited by the sequential part.Murphy’s Law: Anything that can go wrong will go wrong.Postel’s Law: Be conservative in what you send, liberal in what you accept.Sturgeon’s Law: 90% of everything is crap.Cunningham’s Law: The fastest way to get the right answer online is to post the wrong one. So, let’s dive in. How Systems Get Built 1. Gall’s Law A complex system that works is always built from a simple system that worked first. Systems do not work as well in real life as they do on paper because many problems do not surface until they hit the real world. These problems only appear when real users interact with systems, and by then, they either work or they do not. Every complex system that works got that way one step at a time. The systems that try to be perfect from the start usually fail. This is why most new versions of systems rewritten from scratch do not work out: teams keep all the features they had before, but lose the simple things that made the old systems good. Examples. Let’s take an example of Instagram. At the start, it was something else, but not a picture-sharing platform. The app was called Burbn, and it had: check-ins, gaming, photo sharing, all stuck together. Then, the founders cut everything except photo sharing, and the stripped-down core became the product. Google Wave went the other way. It launched with chat, email, a forum, and a document editor, all at once. Nobody could tell you what it was for, and it was dead in 15 months. 2. KISS (Keep It Simple, Stupid) Keep it simple. Anything beyond that is overhead. The KISS principle is a reminder that simplicity should be our key goal. If you can solve a problem with a 50-line script vs a complex 500-line solution, KISS favors the simpler solution because each line of code has the potential to cause an error. Why is simplicity so important? Software, in general, is complex to build and must be understood by humans. A simple design is much easier to maintain: new team members can get up to speed faster, bugs are easier to localize, and modifications cause fewer ripple effects. The KISS principle encourages developers to resist “clever” code that does too much at once, and to avoid architecting solutions that address future problems at the cost of current complexity. Example. Let’s say that we have a startup that needs a feature-flag system and decide to build a custom solution. They built it as a separate microservice with its own database, cache, admin UI, WebSocket notifications, and A/B testing support. It introduces a lot of complexity and takes a lot of time to build, which, if something goes wrong, can cause a lot of trouble. What they needed was a JSON config file. This would have taken an afternoon. 3. Conway’s Law Organizations design systems that mirror their communication structure. Your app architecture is already defined and essentially the same as your organization chart. For example, if you have four teams working on a project, you will probably end up with an app that has four parts. If the teams that work on the frontend, the backend, and the data do not communicate, your application will have three parts that do not work well together. If you rewrite your system without changing how your company is organized, you will still have the system, just written in a different language. The other way around works too. You can pick the architecture you want and then create teams that would naturally produce that kind of system. Amazon did this back in the 2000s. They broke their system down into smaller services managed by small teams, which changed how the system and the company worked together. This is called Inverse Conway’s Maneuver. Examples. Many modern AI organizations often split research from application engineering. Then, research optimizes benchmarks, while product ships apps against real users. The output is a model that scores well and a product that doesn’t work, because each side is optimizing for its own communication boundary. The pattern shows up at a small scale, too. A three-person team almost always ships a monolith because the cost of breaking it up is higher than the cost of keeping it together. 4. Hyrum’s Law With enough users, every observable behavior of your API becomes someone’s dependency, no matter what the contract says. The interface contract you wrote is not a proper contract. The real one is what your system actually does, including the parts you never expected to be important. For example, it could be timing, error message text, key order in JSON responses, and the exact bytes of a hash. Someone, somewhere, is depending on all of it. This is why backward compatibility costs so much in mature systems. This means that you actually don’t maintain the API you designed, but the accidental one. Examples. A good example is the SimCity game. I remember well that it had a use-after-free bug that worked fine on Windows 3.x because memory was never actually reclaimed. Then, Windows 95 reclaimed it, and SimCity crashed. Microsoft shipped Windows 95 with a special memory-allocator mode that was activated only when SimCity was running, so the bug would continue to work. Browsers do this at internet scale. Every quirk that web developers built into the platform effectively becomes part of it. The browser can’t change the quirk without breaking half the web. 5. CAP Theorem A distributed system can guarantee only two of the following: Consistency, Availability, and Partition tolerance. Networks fail. In a distributed system, that's not something you design around. It's something you accept. Once a partition happens, you have to pick: block writes to keep data consistent, or keep serving traffic and let replicas drift. Every distributed database makes this call. Most just don't tell you which one. They hide behind labels like "eventually consistent" or "highly available" and leave you to find out during an incident. Examples. MongoDB favors consistency, meaning that when a partition problem occurs, some MongoDB replicas will not accept any data until the entire system is working properly again. On the other hand, Cassandra will keep answering queries even when the replicas do not agree, and it will later fix the inconsistencies. Neither MongoDB nor Cassandra is wrong. They are just making choices about what your system can afford to lose. 6. Zawinski’s Law Every program expands until it can read mail. The ones that cannot are replaced by ones that can. Feature creep is not something that happens during the process. It is actually the process itself. When a tool is good at what it does, and people like it, they start using it all the time. The people in charge of the product want to keep the users engaged and stay on the platform. So the tool begins to take on tasks that are related to it. Over time, the tool becomes really slow and has a lot of unnecessary extra features. Then a new competitor comes along with a simpler version that does exactly the same thing. As the app's popularity grows, more and more unnecessary features are added. Examples. A famous example is Netscape, which started as a browser and ended as a suite with email, news, and a web editor. Firefox came as a fix and stripped it down, got popular, but then added plugins and a developer toolchain. We also remember Slack, which was launched to kill email and now has voice, video, bots, and an app directory. All of this is possible if the product doesn’t have the right north star metrics. How Teams Lose Speed 7. Brooks’s Law Adding people to a late software project makes it later. Software work is not easy to split among team members. When you bring someone new onto the project, it takes them a while to get up to speed, which means your experienced people have to stop what they are doing to help the new person learn. If your project is already behind schedule, adding more people won't make it go faster. It will just make things worse. Frederick P. Brooks said it well: you cannot have a baby in one month just because you have nine women pregnant. Software work is, like that, too. Software work does not get done faster just because you have people working on it. Example. Once, I was a team lead of eight people, and we were always behind schedule. My first thought was to hire two engineers to help us catch up. But in the meantime, while we were searching for new people, two people left us. It seemed that everything was now working better, communication was easier, and we managed to do more than before. So, obviously, the solution was to make the team smaller, not bigger. 8. Ringelmann Effect As teams grow, output per person falls. When many people pull on the rope, each person does not pull as hard. Some of this is because it is hard to work smoothly, and some of it is because people think someone else will do the part. Either way, this pattern is real. It is more extreme than most people think. Examples. A large GitHub study measured this directly. Developers on teams of 2-5 people averaged around 1,850 lines of code a month, while a team of 10 dropped to 1,200. At 50 or more, it was 450. Output per person fell 75%. This is why small teams ship faster than big ones, and why Amazon’s two-pizza rule holds true. It’s a defense against Ringelmann. This is especially true in today's AI-driven world, where productive teams have fewer members than before, as AI is driving up personal and team productivity. 9. Price’s Law Half the work is done by the square root of the people. In a group of 100 people, about 10 people actually do half of the work that matters. If you have a group of 16 people, it is likely that 4 people do most of the work. This is true for every creative field. The people in the group who do most of the work are really important, but the others are important too, because they do what needs to be done to support everyone else. They make sure everything runs properly (sometimes called glue work). So we need both groups, but the problem is that if the top people in your group leave, the group will lose a lot of its ability to get things done. Example. We all know that when Musk took over Twitter, it cut its staff by roughly 50%, and the site kept running. Price’s Law predicted that. What the law did not predict was what the layoffs removed: depth in trust and safety, SRE coverage, and incident response. The top performers kept the lights on. The organization lost the ability to handle the next hard problem, and Twitter quietly asked some laid-off people to come back. Why Plans Drift 10. Hofstadter’s Law It always takes longer than you expect, even when you account for Hofstadter’s Law. Let’s say you need to estimate how long something will take. You think four weeks is an estimate, but then you remember that your guesses are usually too optimistic, so you double it to eight weeks, just to be sure. But in the end, it takes sixteen weeks. Now you think, the next time you will be better, aren’t you? You think it will take sixteen weeks because that's what happened the last time. No, it now takes thirty-two weeks, because things you don’t know about surprise you. These are tasks such as unplanned integration issues or requirement changes. In practice, Hofstadter’s Law explains why techniques like padding estimates, awareness of Parkinson’s Law, and the use of historical data are essential, yet surprises still occur. Example. A good example of the Hofstadter law is the Berlin Brandenburg Airport project. The software integration process was taking much longer than expected, as it involved 75,000 sensors and 50,000 light fittings. The plan was to take 18 months to finish, but they later realized this was not possible and extended the timeline to 30 months. In the end, it took 7 years to complete, with a final cost of €7 billion. This was 2.5x higher than planned, and the airport opened 9 years late. 11. Dunning-Kruger Effect The less you know about something, the more confident you tend to be. Here is the uncomfortable part. The skill you need to do something is the same skill you need to judge how well you did the thing, and this is the problem. People who are not very good at something cannot see what they are doing wrong, so they think they are better at the thing than they really are. Yet, people who are good at it see all the things they are still getting wrong, so they think they are not as good at it as they really are. Examples. When asked when something will be done, new developers often give confident, precise estimates, while experienced developers give ranges (the famous “it depends” answer). The juniors aren’t wrong to be convinced. They simply don’t yet know what they don’t know (unknown-unknowns). People usually get really excited about new technology at first. This is because they have not used it a lot yet. We are seeing this happen with artificial intelligence now. The people who say AI can do anything are usually the ones who do not use it every day, like managers. 12. Parkinson’s Law Work expands to fill the time available. If you give a developer two weeks to do a task that can be done in two days, it will take two weeks to finish. This does not mean the developer is lazy or puts things off. People tend to fill up the time they have. Over the two weeks, the developer will likely spend time making plans, trying things, and adding extra tasks that do not need to be done (gold-plating). But if there was a deadline to have this done in a day, it would probably be done on that day. The thing about Parkinson’s Law is that it says if you give people a certain amount of time to do something, they will probably take all the time to do it. So, teams should set clear and realistic time limits (aka deadline-driven development). However, managers must use it judiciously, combining Parkinson’s insight with realistic scheduling. If you compress timelines too much, you risk running into Hofstadter’s Law, which reminds us that work often still takes longer than expected, even with buffers. Examples. A developer given two months for a one-week task will spend a month prototyping alternatives, another week on architecture debates, and the last three weeks polishing details nobody asked for. If we give the same task, but this time with a clear one-week deadline, it will be shipped in one week. How Metrics Distort Work 13. Goodhart’s Law When a measure becomes a target, it stops being a good measure. We can use many different ways to measure our work, e.g., number of bugs closed, number of incidents, test coverage, or team velocity. When we start measuring people's performance based on these things, they will focus on making those numbers look good instead of actually doing good work. The numbers will go up, but the work will not get any better. This is because when we give people incentives, they will do what gets them the reward, not what we really want. When we measure the wrong thing, people will do the wrong thing to get ahead. Examples. I watched a team get rewarded for lines of code written at the start of 2000, and the number of PRs created some years later. Developers started copy-pasting instead of extracting shared logic. Some created PRs for almost every commit they made. The modern version is AI tokens consumed per engineer (called tokenmaxxing). More tokens are being treated as a sign of productivity. 14. Gilb’s Law Anything you need to quantify can be measured in some way that beats not measuring it at all. Gilb's Law is like the side of the coin to Goodhart’s Law. You can say, when looking at Goodhart’s Law, that having metrics is bad, but that is actually not true. Not having any metrics is even worse than that. If something is important to you, you should try to find a way to measure it, because we cannot improve what we don’t measure (as Peter Drucker famously said). Example. Developer productivity is usually a hard thing to measure, and it always has been. We had many bad metrics, from lines of code to token consumption. But deployment frequency and change lead time give you a signal (as in the DORA metrics for DevOps) as a proxy. What Breaks Under Load 15. Knuth’s Optimization Principle Premature optimization is the root of all evil. Most performance work happens too early and in the wrong place. Teams optimize code paths that never become hot, introduce complexity they never need, and burn time solving a scale problem they may never earn. So the best way is to write the code that works, then check its performance. If there is a problem, a tool will show you where it is. If not, just move on. Examples. I worked at a startup once, where we spent a lot of time setting up Kubernetes. The thing was that we did it to handle millions of users, and we didn’t even have 10 users yet. We were making our infrastructure ready for a load that didn’t exist. Our product features were not even finished. One of my colleagues said that we should make sure 100 people even want our product before we worry about handling millions of users. He was right. We still launched late. 16. Amdahl’s Law The speedup from parallelism is limited by the sequential part. If 10% of your work has to be done in a sequential way, the work will only go 10x faster, no matter how many computers you use. If 50% of the work has to be done one thing at a time, the work will only go twice as fast. The same thing happens with people. If one group of people has to say yes to every decision, about how something is built, that limits how fast your team can work, no matter how many engineers you have. If you add engineers, but they all have to wait for the same group of people to say yes, the line of people waiting just gets longer. Your team of engineers will still be slow because the group of people making decisions is a bottleneck. The work of your team of engineers will only go as fast as the group of people making decisions. Examples. Scaling web traffic by adding more app servers helps until every request hits one shared database or authentication service. Then adding more horizontal scaling doesn’t help. The conversation about AI productivity is hitting the roof now. AI makes coding faster, but you still have to think, check, fix errors, and work together on those steps that can’t be done simultaneously. This sets the limit on how much you can gain in the end. That’s why some engineers see their work speed up by 10 times, and others see a 1.2 times increase. 17. Murphy’s Law Anything that can go wrong will go wrong. In software, Murphy’s Law is often mentioned to explain bugs and production incidents: whatever can go wrong in code (a null pointer, a race condition, a network outage) will eventually manifest, especially in large user bases or at the worst possible time (Friday evening). In practice, this law encourages developers to write more defensive code. This means checking for nulls, handling exceptions, validating inputs, and failing gracefully when errors occur. It also reminds DevOps teams to anticipate failures by implementing monitoring, enabling rollbacks, and maintaining contingency plans. Example. On July 19 2024, CrowdStrike made a change to the Falcon Sensor settings. This change caused a memory issue on Windows machines. It made 8.5 million Windows machines stop working and show a screen. To fix this problem, someone had to log in to each machine and apply the fix, because those machines could not start up. This could be done remotely. And this happened on a Friday morning when no IT staff members were working. It caused problems for airlines, hospitals, and banks. Everything that could go wrong did go wrong on the day, just like Murphy’s Law says. 18. Postel’s Law Be conservative in what you send, liberal in what you accept. This law says that if your server sends HTTP responses, it should format headers exactly per spec. But if your server receives an HTTP request with an uncommon header order or an unusual format, you should still process it rather than drop the connection, as long as you can interpret it safely. Browsers do this at a scale. Most of the HTML on the web is not written correctly, but modern browsers still render it. If they were strict, half the internet would not be found. But there is one thing to consider. Being too liberal has a cost: if everyone accepts anything, problems will never be corrected. There will be just more mess. In security-sensitive code, tolerating input can make it easier for attackers to find. So, the basic idea still holds. You need to use judgment, as being lenient is not the same as being permissive. Example. In APIs, say your service expects a timestamp. If it receives a timestamp without a time zone, instead of rejecting, maybe you assume UTC or try to parse it anyway, being liberal in acceptance. But when your service returns data, you always include the time zone to ensure the output is conservative and precise. How to Judge Better 19. Sturgeon’s Law 90% of everything is crap. Most things we make will go unused, and most of the code we write is not good. Most projects we start do not deliver the value that we thought they would. This is not a bad thing per se. This is how things are when we are trying to create something new. If we pretend everything is great, we will treat every project the same, which will make things too complicated. The projects that really matter are the ones, like 10% of them. Finding these projects and getting rid of all the others is what really takes skill. Example. WordPress has roughly 57,000 plugins in its directory. Over 34,000 haven’t been updated in the past 2 years, and nearly 19% have zero active installs. A small number of well-maintained plugins powers 40%+ of the public web. That distribution is Sturgeon’s Law in one screenshot. 20. Cunningham’s Law The fastest way to get the right answer online is to post the wrong one. When you ask a question on some online forum, you usually get no response. If you post something that is clearly incorrect, people will jump in to correct you. They might just walk by if they see a question, and then cannot help themselves when they see something that is wrong. You can actually use this to your advantage. If you are having trouble with something, do not ask how you should do it. Instead, propose a solution you know is not very good, or share a draft, and then see what happens. The right answer might come to you without you even asking for it. Note that this trick only works when the people around you know what they are talking about. If you are in a group where everyone’s just as confused as you are, then a wrong answer can actually cause more harm than good. In that case, the wrong answer can just become information that people start to believe. Example. The whole bet of wikis, and later Wikipedia, runs on this insight. People correct errors faster than they write articles from scratch. The bet paid off on a civilization-scale. Conclusion In this article, I shared some of the most impactful laws I saw in my career. You do not have to memorize all of them. The top five or six laws will help you solve most of your issues. The rest are there for when a new problem arises. What is more important is knowing when a law applies and when it does not. These twenty laws often conflict with each other. Knuth says do not optimize early. Amdahl says find and fix the part of your project that is slowing everything down. Both are correct at times. The key is to know which one to use now. Also, this list is my list. Your list will be different. The laws that have caused you problems will be more important to you than the ones that have not. Over time, you will add your laws. Write them down when you notice them. One line per project, incident, or rewrite. Which law helped you? Which law gave you advice? What changed? Your personal list will be more helpful to you than any list I can give you. Frameworks, platforms, and deployment models have changed since Brooks wrote his book in 1975. These laws have not changed. They describe the one thing that has not changed: humans building things together under constraints they do not yet fully understand. That is why they are worth learning before the project, not after it causes problems.
Why Fine-Tune on Databricks? General-purpose LLMs like Llama 3, Mistral, or Falcon are impressive out of the box — but they underperform on domain-specific tasks: medical coding, legal clause extraction, internal support ticket classification, and financial report summarization. Fine-tuning adapts a pre-trained model's weights to your domain using your proprietary labeled data. Doing this at scale introduces real engineering challenges: Training data lives in Delta Lake across dozens of tablesGPU clusters need to be orchestrated, not hand-managedExperiment tracking must be reproducible and auditableModels need a promotion workflow before they touch production traffic Databricks solves all of this in one platform: Apache Spark for large-scale data preparationMLflow (built-in) for experiment tracking, model registry, and lineageDatabricks Model Serving for one-click deployment with auto-scalingUnity Catalog for governed model and data access The ML Lifecycle Architecture Training Pipeline: End-to-End Flow The flow below shows how a single training run moves through the system — from a triggered job to a promoted model alias. Environment Setup Python # Databricks Runtime ML 14.x+ recommended (ships CUDA, PyTorch, Transformers) # Install additional packages in your cluster init script or notebook %pip install \ transformers==4.40.0 \ peft==0.10.0 \ trl==0.8.6 \ accelerate==0.29.3 \ horovod[spark]==0.28.1 \ datasets==2.19.0 \ evaluate==0.4.1 \ --quiet dbutils.library.restartPython() import os import mlflow import mlflow.transformers import torch from transformers import ( AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling, ) from peft import LoraConfig, get_peft_model, TaskType from pyspark.sql import functions as F from datasets import Dataset # ── MLflow setup ────────────────────────────────────────────────────────────── # On Databricks, MLflow tracking URI is pre-configured to the workspace # mlflow.set_tracking_uri("databricks") # uncomment for external clusters EXPERIMENT_NAME = "/Users/[email protected]/llm-finetuning/support-classifier" mlflow.set_experiment(EXPERIMENT_NAME) BASE_MODEL = "mistralai/Mistral-7B-Instruct-v0.2" CATALOG = "prod" GOLD_DB = f"{CATALOG}.gold" MODEL_NAME = f"{CATALOG}.ml.support_intent_classifier" # Unity Catalog model path print(f"GPU available: {torch.cuda.is_available()}") print(f"Device count: {torch.cuda.device_count()}") Preparing Training Data With Spark Spark handles the heavy lifting before training: filtering noisy records, formatting prompt-response pairs, and splitting the dataset. This stage runs on the CPU cluster — GPU nodes only spin up for the actual training job. Plain Text # ── Spark Data Preparation ──────────────────────────────────────────────────── def build_prompt(row): """ Format a support conversation into an instruction-following prompt. Uses the Mistral instruct template: [INST] ... [/INST] """ return f"[INST] Classify the intent of this support message:\n\n{row['message']} [/INST] {row['intent_label']}" # Load from Delta Gold table raw_df = ( spark.table(f"{GOLD_DB}.support_conversations") .filter(F.col("quality_score") >= 0.85) # keep high-quality labels only .filter(F.col("intent_label").isNotNull()) .filter(F.length("message") > 20) # filter empty/stub messages .filter(F.length("message") < 2048) # filter messages too long to tokenize .dropDuplicates(["message_hash"]) # remove exact duplicates .select("message", "intent_label", "created_date") .limit(500_000) # cap for this training run ) print(f"Training candidates: {raw_df.count():,}") # Build prompt strings using Spark — parallelized across all workers prompt_udf = F.udf( lambda msg, label: f"[INST] Classify the intent of this support message:\n\n{msg} [/INST] {label}", returnType="string" ) prepared_df = ( raw_df .withColumn("prompt", prompt_udf(F.col("message"), F.col("intent_label"))) .withColumn("token_count", F.size(F.split(F.col("prompt"), r"\s+"))) # rough word count proxy .filter(F.col("token_count") < 512) # stay within model context .select("prompt", "token_count", "created_date") ) # Stratified split using Spark (reproducible with seed) train_df, val_df, test_df = prepared_df.randomSplit([0.80, 0.10, 0.10], seed=42) # Persist splits to Delta for lineage + reproducibility train_df.write.format("delta").mode("overwrite").saveAsTable(f"{GOLD_DB}.llm_train_split") val_df.write.format("delta").mode("overwrite").saveAsTable(f"{GOLD_DB}.llm_val_split") test_df.write.format("delta").mode("overwrite").saveAsTable(f"{GOLD_DB}.llm_test_split") print(f"Train: {train_df.count():,} | Val: {val_df.count():,} | Test: {test_df.count():,}") Fine-Tuning With Hugging Face + MLflow Tracking We use LoRA (Low-Rank Adaptation) — a parameter-efficient fine-tuning technique that freezes the base model and only trains a small set of adapter matrices. This cuts GPU memory requirements by ~70% compared to full fine-tuning, making 7B parameter models trainable on a single A100. Python # ── LoRA Fine-Tuning with MLflow Autolog ───────────────────────────────────── # Convert Spark DataFrame to Hugging Face Dataset train_pd = spark.table(f"{GOLD_DB}.llm_train_split").select("prompt").toPandas() val_pd = spark.table(f"{GOLD_DB}.llm_val_split").select("prompt").toPandas() hf_train = Dataset.from_pandas(train_pd) hf_val = Dataset.from_pandas(val_pd) # Load tokenizer and base model tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, padding_side="right") tokenizer.pad_token = tokenizer.eos_token def tokenize(batch): return tokenizer( batch["prompt"], truncation=True, max_length=512, padding="max_length", ) hf_train_tok = hf_train.map(tokenize, batched=True, remove_columns=["prompt"]) hf_val_tok = hf_val.map(tokenize, batched=True, remove_columns=["prompt"]) # Load base model in 4-bit quantization (QLoRA) from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, ) base_model = AutoModelForCausalLM.from_pretrained( BASE_MODEL, quantization_config=bnb_config, device_map="auto", trust_remote_code=True, ) # Apply LoRA adapter config lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=16, # rank — higher = more capacity, more memory lora_alpha=32, # scaling factor lora_dropout=0.05, target_modules=["q_proj", "v_proj"], # attention layers to adapt bias="none", ) model = get_peft_model(base_model, lora_config) model.print_trainable_parameters() # Typical output: trainable params: 13,631,488 || all params: 3,765,522,432 || trainable: 0.36% # Training arguments training_args = TrainingArguments( output_dir="/dbfs/tmp/llm-finetune/checkpoints", num_train_epochs=3, per_device_train_batch_size=4, per_device_eval_batch_size=4, gradient_accumulation_steps=8, # effective batch size = 32 warmup_ratio=0.03, learning_rate=2e-4, fp16=False, bf16=True, # use bfloat16 on A100/H100 logging_steps=50, eval_strategy="steps", eval_steps=200, save_strategy="steps", save_steps=200, load_best_model_at_end=True, metric_for_best_model="eval_loss", report_to="mlflow", # pipe all metrics to MLflow automatically ) data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False) trainer = Trainer( model=model, args=training_args, train_dataset=hf_train_tok, eval_dataset=hf_val_tok, tokenizer=tokenizer, data_collator=data_collator, ) # ── MLflow Run ──────────────────────────────────────────────────────────────── with mlflow.start_run(run_name="mistral-7b-lora-v1") as run: # Log hyperparameters manually for full auditability mlflow.log_params({ "base_model": BASE_MODEL, "lora_rank": lora_config.r, "lora_alpha": lora_config.lora_alpha, "lora_dropout": lora_config.lora_dropout, "target_modules": str(lora_config.target_modules), "quantization": "4-bit QLoRA (nf4)", "train_samples": len(hf_train_tok), "val_samples": len(hf_val_tok), "epochs": training_args.num_train_epochs, "effective_batch": training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps, "learning_rate": training_args.learning_rate, }) # Train — metrics auto-logged to MLflow via report_to="mlflow" trainer.train() # Log final eval metrics explicitly eval_results = trainer.evaluate() mlflow.log_metrics({ "final_eval_loss": eval_results["eval_loss"], "final_eval_perplexity": torch.exp(torch.tensor(eval_results["eval_loss"])).item(), }) # Log the model + tokenizer as a single MLflow artifact mlflow.transformers.log_model( transformers_model={"model": trainer.model, "tokenizer": tokenizer}, artifact_path="model", task="text-generation", registered_model_name=MODEL_NAME, # auto-registers to Unity Catalog metadata={"base_model": BASE_MODEL, "finetuning": "QLoRA"}, ) run_id = run.info.run_id print(f"Run ID: {run_id}") print(f"Eval Loss: {eval_results['eval_loss']:.4f}") Distributed Training With Horovod on Spark For datasets beyond a few million tokens, or when you need to fine-tune models larger than 13B parameters, single-node training hits GPU memory walls. Horovod distributes training across multiple GPU workers using ring-allreduce — each worker holds a full model replica, and gradients are averaged across workers after every backward pass. Python # ── Distributed Fine-Tuning with Horovod on Spark ──────────────────────────── # Best for: datasets > 5M tokens, models > 13B params, or when you need # to reduce wall-clock training time below a business SLA. import horovod.torch as hvd from sparkdl import HorovodRunner def train_fn(hparams): """ Training function executed on each Horovod worker. Each worker trains on a data shard; gradients are averaged across workers. """ import horovod.torch as hvd from transformers import AutoModelForCausalLM, Trainer, TrainingArguments from datasets import load_from_disk hvd.init() # Each worker loads only its shard local_rank = hvd.local_rank() world_size = hvd.size() torch.cuda.set_device(local_rank) # Load dataset shard for this worker dataset = load_from_disk(f"/dbfs/tmp/llm-finetune/train_shards/shard_{local_rank}") model = AutoModelForCausalLM.from_pretrained( BASE_MODEL, torch_dtype=torch.bfloat16, ).to(f"cuda:{local_rank}") # Wrap optimizer with Horovod DistributedOptimizer optimizer = torch.optim.AdamW(model.parameters(), lr=hparams["lr"]) optimizer = hvd.DistributedOptimizer( optimizer, named_parameters=model.named_parameters(), compression=hvd.Compression.fp16, # compress gradient communication ) # Broadcast initial model weights from rank 0 to all workers hvd.broadcast_parameters(model.state_dict(), root_rank=0) hvd.broadcast_optimizer_state(optimizer, root_rank=0) training_args = TrainingArguments( output_dir=f"/dbfs/tmp/llm-finetune/hvd_output", num_train_epochs=hparams["epochs"], per_device_train_batch_size=hparams["batch_size"], bf16=True, no_cuda=False, dataloader_num_workers=2, # Only rank 0 logs and saves — avoids duplicated artifacts report_to="mlflow" if hvd.rank() == 0 else "none", save_strategy="epoch" if hvd.rank() == 0 else "no", ) trainer = Trainer( model=model, args=training_args, train_dataset=dataset, optimizers=(optimizer, None), ) trainer.train() # Only rank 0 registers the model if hvd.rank() == 0: mlflow.transformers.log_model( transformers_model={"model": model, "tokenizer": tokenizer}, artifact_path="model", registered_model_name=MODEL_NAME, ) # Launch distributed training across N GPU workers # np = number of processes = number of GPUs across all nodes hr = HorovodRunner(np=8, driver_log_verbosity="all") # 8 GPUs (e.g., 2 × 4-GPU nodes) hr.run(train_fn, hparams={ "lr": 2e-5, "epochs": 3, "batch_size": 2, # per GPU; effective = 2 × 8 = 16 }) MLflow Model Registry and Promotion Once a run completes, models land in the MLflow Model Registry. Databricks uses Unity Catalog-backed model aliases (candidate, staging, champion) instead of the legacy stage model. Python # ── Model Registry Promotion Workflow ───────────────────────────────────────── from mlflow.tracking import MlflowClient client = MlflowClient() # Get the latest registered version from the training run latest_version = client.get_registered_model(MODEL_NAME).latest_versions[0].version # Tag the new version as a candidate for review client.set_registered_model_alias( name=MODEL_NAME, alias="candidate", version=latest_version, ) client.set_model_version_tag( name=MODEL_NAME, version=latest_version, key="fine_tuned_on", value="gold.support_conversations", ) client.set_model_version_tag( name=MODEL_NAME, version=latest_version, key="eval_loss", value=str(round(eval_results["eval_loss"], 4)), ) # After human review / automated eval gates pass → promote to staging client.set_registered_model_alias( name=MODEL_NAME, alias="staging", version=latest_version, ) # After integration tests pass → promote to champion (production) client.set_registered_model_alias( name=MODEL_NAME, alias="champion", version=latest_version, ) # Load model by alias — decouples code from version numbers champion_model = mlflow.transformers.load_model(f"models:/{MODEL_NAME}@champion") Serving With Databricks Model Serving Python # ── Deploy to Databricks Model Serving ──────────────────────────────────────── # Can also be done via the UI: Models > Serving > Create Endpoint import requests, json WORKSPACE_URL = "https://<your-workspace>.azuredatabricks.net" TOKEN = dbutils.secrets.get("prod-scope", "databricks-token") endpoint_config = { "name": "support-intent-classifier", "config": { "served_models": [ { "name": "mistral-7b-lora-champion", "model_name": MODEL_NAME, "model_version": latest_version, "workload_size": "Small", # 1 GPU "scale_to_zero_enabled": True, "workload_type": "GPU_LARGE", # A10G } ], "traffic_config": { "routes": [ {"served_model_name": "mistral-7b-lora-champion", "traffic_percentage": 100} ] }, "auto_capture_config": { "catalog_name": CATALOG, "schema_name": "ml", "table_name": "support_classifier_inference_log", "enabled": True, # log all requests/responses to Delta } } } response = requests.post( f"{WORKSPACE_URL}/api/2.0/serving-endpoints", headers={"Authorization": f"Bearer {TOKEN}", "Content-Type": "application/json"}, data=json.dumps(endpoint_config), ) print(response.json()) # ── Query the endpoint ──────────────────────────────────────────────────────── def classify_intent(message: str) -> str: payload = { "inputs": {"prompt": f"[INST] Classify the intent of this support message:\n\n{message} [/INST]"}, "params": {"max_new_tokens": 50, "temperature": 0.1}, } resp = requests.post( f"{WORKSPACE_URL}/serving-endpoints/support-intent-classifier/invocations", headers={"Authorization": f"Bearer {TOKEN}", "Content-Type": "application/json"}, data=json.dumps(payload), ) return resp.json()["predictions"][0] print(classify_intent("My order hasn't arrived and it's been 10 days")) # → "shipping_delay" Comparing Fine-Tuning Strategies StrategyGPU MemoryTraining TimeQuality vs Full FTWhen to UseFull Fine-TuningVery High (80GB+)SlowestBaseline (100%)Max quality, large budgetLoRAMedium (24–40GB)Fast~95%Best general-purpose choiceQLoRA (4-bit + LoRA)Low (10–16GB)Medium~90–93%Single GPU, cost-sensitivePrefix TuningLowVery Fast~80–85%Minimal compute, quick iterationPrompt TuningVery LowFastest~70–80%Inference-only, no weight changeRLHF / DPOHighSlowestBest alignmentInstruction-following qualityDistillationMedium (teacher)MediumVariesSmaller, faster inference model Rule of thumb: Start with QLoRA on a single GPU. If eval loss stagnates or quality gates fail, move to LoRA on multi-GPU. Full fine-tuning is only warranted when you have >1M high-quality labeled examples and a measurable business case for the incremental quality gain. Key Takeaways Spark handles data at scale before training even begins — filtering, tokenization, and splitting across millions of records in minutes.QLoRA + LoRA makes fine-tuning 7B–13B models accessible on a single A100, reducing memory footprint by ~70% with minimal quality loss.MLflow report_to="mlflow" gives you automatic experiment tracking with zero extra code — every loss curve, gradient norm, and learning rate schedule is captured.Unity Catalog model aliases (candidate → staging → champion) replace brittle version-number references in deployment code, making promotions and rollbacks a one-liner.Auto Capture on Databricks Model Serving logs every inference request and response to a Delta table — giving you a feedback loop to build your next training dataset.Horovod on Spark is the right tool when single-node training exceeds your SLA — it leverages your existing Spark cluster without a separate orchestration layer. References Databricks — LLM Fine-Tuning on DatabricksMLflow — Transformers Flavor DocumentationHugging Face PEFT — LoRA & QLoRAQLoRA Paper — "QLoRA: Efficient Finetuning of Quantized LLMs" (Dettmers et al., 2023)LoRA Paper — "LoRA: Low-Rank Adaptation of Large Language Models" (Hu et al., 2021)Databricks — Model Serving (Foundation Model APIs)Horovod on Spark — Official DocumentationDatabricks — HorovodRunner APIDatabricks — Inference Tables (Auto Capture)"Training language models to follow instructions with human feedback" — InstructGPT / RLHF (OpenAI, 2022)
She had everything on the list. Eight years of experience. Strong systems design. Distributed architecture under her belt. The panel interview went well — one of the hiring managers later described it as the best technical conversation they'd had with a candidate all quarter. The team passed on her. Two weeks later, during a casual conversation with that hiring manager, the reason came out. It wasn't her architectural skills or her communication. It was a question someone had slipped in near the end: "Walk us through how you'd set up an AI-assisted code review pipeline for a team that ships twelve microservices." She described doing it manually. The other finalist described standing up an orchestration layer with context-aware models, configuring fallback thresholds, and building observable feedback loops that trained the team's prompt library over time. Same job title. Completely different mental model of what the job now involves. That story isn't unique. It captures something that's been happening gradually over the past eighteen months and then very suddenly in the last six: the senior developer role has quietly split into two jobs. One of them is the job we all trained for. The other is the job that a meaningful portion of your working week now actually requires. And the gap between developers who've accepted that and developers who haven't is becoming very hard to explain away in performance conversations. The Split That Happened Without a Memo Let's be specific about what the "AI Systems Architect" half of the role actually means, because people either over-mystify it or undersell it. It doesn't mean you become a data scientist. It doesn't mean you're fine-tuning models or writing PyTorch. Those are real jobs — they're just different jobs. What it means is something more operational and less glamorous: you are now responsible for designing, maintaining, and improving the systems of AI assistance that your team works inside of, not just the code that the team produces. That sounds abstract until you break it into daily decisions. Which tasks should be fully AI-generated versus AI-assisted versus AI-reviewed only? Where are your model's blind spots for your specific codebase, and how do you account for them in code review? When a junior developer on your team gets a plausible-but-wrong architectural suggestion from an AI assistant, what's the escalation path? How do you measure the quality of your team's prompting over time? These aren't rhetorical questions — they're operational ones that live teams are answering right now, often badly, because no one assigned anyone to own them. Senior developers are getting assigned to own them. Not officially. Not with updated job descriptions. Just through the ordinary mechanism of "this problem needs solving, and you're the most experienced technical person in the room." What "AI Systems Architect" Actually Means Day to Day The phrase sounds bigger than the practice. What it actually breaks down to is four interconnected responsibilities that are now landing on senior developers, whether they want them or not. First: workflow design. Someone has to decide which parts of the development cycle use AI assistance, at what level of autonomy, and with what human checkpoints. At most companies, this currently happens by accident — everyone develops their own habits, and nobody compares notes. The developers who are stepping into the architect half of the role are the ones making that deliberate, rather than emergent. Second: model selection and configuration. Not fine-tuning, but product-level decisions: which models for which tasks, what context window strategy, how to handle codebases that exceed context limits, what fallback behavior looks like. These are practical engineering decisions that live in the space between "developer tool choice" and "infrastructure decision." They belong to senior engineers. Third: quality governance. AI-generated code introduces a new failure mode: plausible-looking outputs that are subtly wrong. The patterns of wrongness are specific and learnable. Senior developers who have mapped the failure modes of their AI tooling — the kinds of edge cases it consistently misses, the naming convention assumptions it gets backward, the security patterns it handles confidently and incorrectly — are providing a form of institutional knowledge that is genuinely hard to replace. Fourth: team prompting culture. This is the one nobody talks about at conferences yet, but engineering managers across the industry have been mentioning it consistently over the past six months: the quality variance in how different team members prompt their AI tools is enormous, and it compounds. Senior developers who build and maintain shared prompt libraries, who do prompt review the way they do code review, who can diagnose why a colleague got a bad output — those developers are operating as a force multiplier for the entire team, not just themselves. The Job Description Before and After: A Concrete Comparison This is worth making explicit. Analysis of actual senior engineer job postings — anonymized, from companies between 80 and 1,200 employees — shows a clear shift when comparing what the role requirements looked like in early 2023 versus what's being written now. The change is real and measurable. The pattern across all of it: the what of the role hasn't changed so much as the how and the governance around it. Senior developers are still responsible for the same categories of work. They're now also responsible for the design of the AI-assisted systems that help a team do that work, and for the failure modes those systems introduce. The New Core Competency Stack Here's what the competency model looks like in practice when you lay it out. The traditional side should feel familiar. The AI architecture side probably contains a few items you haven't formally owned yet — but if you've been doing this job for more than two years and paying attention, you've been building these skills without realizing it. The Salary Premium Is Already Real Compensation data lags reality by about eighteen months, so take specific numbers here with appropriate skepticism. What industry reporting suggests is that a clear pattern is emerging: developers who can demonstrably operate in both halves of the new role — not just use AI tools personally, but architect AI-assisted workflows for a team — are commanding a premium that's running somewhere between 18% and 31% above their single-track counterparts at the same years-of-experience mark. That range is wide. The premium is highest in companies that have recently invested in AI transformation initiatives and learned, the hard way, that "everyone uses Copilot" is not the same as "we have a coherent AI engineering strategy." Those companies are specifically recruiting for systems architect skills because they've already paid for the gap. How to Build the Second Half of the Job Nobody teaches this in a course yet. There are some good books and a growing number of blog posts, but the skills are mostly developed through deliberate practice and iteration. Based on teams that have successfully made this transition, here's what works. The starting point is mapping your team's current AI-assisted work honestly. Not aspirationally — honestly. Which tasks are you and your team currently doing with AI assistance? Where does the output go without sufficient review? What are the categories of error you've caught, and what categories might you be missing? This audit, done once and updated quarterly, is the foundation of a governance practice. From there, the most leveraged thing most senior developers can do is build a shared prompt library for their most common task types. Not a personal one — a shared one, with a versioning and review practice attached. The discipline of reviewing a colleague's prompt and explaining why it produced a wrong output is one of the fastest ways to build the mental model you need for the governance half of the role.
On March 28, 2024, a Microsoft engineer named Andres Freund noticed something almost nobody would have bothered chasing: SSH logins on a system he was benchmarking were taking 500 milliseconds instead of the usual 100. He ran a memory profiler out of irritation more than suspicion, traced the slowdown to liblzma, the compression library bundled with xz-utils, and within a day had uncovered a backdoor planted by a maintainer who'd spent roughly two years earning the trust required to slip it in. The resulting CVE, 2024-3094, drew a perfect CVSS score of 10.0. It also handed the software security world an uncomfortable case study, one I still bring up whenever someone tells me their SBOM program has supply-chain risk handled. Here's why it's uncomfortable: an SBOM generated against the compromised xz-utils 5.6.1 release would have listed exactly that — xz-utils, version 5.6.1 — and it would have been completely accurate. The component was real, the version was real, and the entry would have sailed through every automated check looking for known-bad packages, because nobody knew it was bad yet. The malicious code wasn't an undisclosed dependency. It was hidden inside the build instructions of a package everyone already trusted, smuggled in through doctored upstream release tarballs rather than the public git history reviewers were actually watching. The ingredient list was correct. The ingredient was poisoned. Those are different problems, and conflating them is how organizations end up with a false sense of coverage. What the List Actually Buys You I don't want to undersell SBOMs here, because the underlying idea is sound and the win is real when an incident actually hits. When Log4Shell detonated in December 2021, the organizations that recovered fastest weren't necessarily the most sophisticated — they were the ones who could answer "where does Log4j live in our environment" in minutes instead of weeks, because someone had already built the inventory. That's the entire value proposition in one sentence: an SBOM turns "do we use this component, and where" from an open-ended archaeology project into a query. That value is now backed by regulatory teeth on both sides of the Atlantic. U.S. Executive Order 14028 pushed federal software vendors toward SBOM delivery starting in 2021, and the EU's Cyber Resilience Act has since raised the stakes for anyone selling software with digital elements into the European market: vulnerability and incident reporting obligations begin September 11, 2026, and the full SBOM and secure-by-design requirements land on December 11, 2027, backed by fines that can reach €15 million or 2.5 percent of global turnover. Compliance teams I talk to are treating this less as a paperwork exercise and more as a forcing function, which is the right instinct. But forcing functions only produce good outcomes if people understand what the artifact actually does — and what it was never built to do. When the Ingredient List Becomes the Worm If xz-utils illustrates a poisoned ingredient sitting still inside a static list, the npm ecosystem spent the back half of 2025 demonstrating what happens when the poison starts moving on its own. On September 15, security researchers identified a self-replicating piece of malware that came to be called Shai-Hulud, which spread by stealing developer credentials and npm publishing tokens, then using those tokens to inject itself into every other package the compromised maintainer had access to — silently republishing trojanized versions across the registry. It traced back to an account-takeover incident from late August known as the s1ngularity/Nx compromise, and by the time researchers had mapped it, more than 500 packages had been touched, including infrastructure used by CrowdStrike. Unit 42 later assessed, with moderate confidence, that the malicious shell script itself had been drafted with the help of an LLM — based on the comments and emoji left in the code, which is the kind of detail that makes this beat simultaneously fascinating and exhausting to cover. The worm didn't stay down. A second wave — Shai-Hulud 2.0 — surfaced in late November 2025, this time executing during the pre-install phase rather than post-install, which widened its reach into CI/CD pipelines well before any human reviewed the package contents. By the time defenders had a handle on it, the campaign had touched more than 25,000 GitHub repositories across roughly 350 accounts. Sonatype's 2026 State of the Software Supply Chain report puts the broader trend in context: more than 454,000 newly identified malicious packages in 2025 alone, pushing the cumulative known total past 1.2 million across npm, PyPI, and similar registries — a haul that reportedly even included output from North Korea's Lazarus Group, which alone published several hundred trojanized npm packages over the year. This is where the metaphor in this piece's title stops being a metaphor. An SBOM is a snapshot taken at build time. A self-propagating worm doesn't wait for your next build. By the time your inventory catches up to what's actually running in production, the compromised version may already have spread three hops further than the document describing it. Why Signing and Provenance Close Part of the Gap The honest fix isn't a better SBOM. It's pairing the SBOM with proof of where the artifact actually came from, which is what the Sigstore project and the SLSA framework exist to provide. Sigstore's components do three specific jobs: Fulcio issues short-lived signing certificates tied to a developer or CI identity via OIDC, instead of the long-lived private keys that inevitably end up mismanaged; Cosign signs and verifies the resulting artifacts; and Rekor records every signing event in a public, append-only transparency log, so a substituted artifact leaves a visible gap rather than a silent one. SLSA layers maturity levels on top of that: Level 2 is now realistic to reach in an afternoon on GitHub Actions, largely because GitHub's native attestation support has matured since 2024, and the Linux Foundation pushed out SLSA 1.2 in late 2025 with more granular tracking for both build and source provenance. Run the GhostAction incident from earlier in 2025 through that lens, and the gap becomes obvious. Attackers compromised a widely used third-party GitHub Action and modified its workflow code to exfiltrate secrets, and because downstream repositories had pinned that action by a mutable version tag rather than an immutable commit SHA, every project referencing @v1 automatically pulled the poisoned update with zero additional effort from the attacker. Signed provenance tied to a specific, verified commit wouldn't have stopped someone from compromising the upstream repository — but it would have made the substitution detectable the moment a consuming pipeline tried to verify what it was actually pulling, instead of trusting a tag that anyone with write access could quietly repoint. What a Mature Pipeline Actually Refuses to Run The pattern I'd point any engineering leader toward right now isn't exotic, it's just rarely implemented end to end: nothing gets promoted unless it clears a gate that checks signature, provenance, and SBOM together, not any one of the three in isolation. Plain Text Source Commit | v Build System | ----generate----> SBOM (CycloneDX/SPDX) | |--sign via Cosign---> Signature + SLSA Provenance (Rekor log) | v Deploy Gate <----checks all three----> [Signature valid? Provenance matches? SBOM clean of known CVEs?] | PASS --------> Production | FAIL --------> Blocked, alert raised, artifact quarantined Notice what that gate is actually doing: it isn't asking "do we have an SBOM," which is a yes/no compliance question. It's asking whether the artifact about to run matches the provenance it claims, whether that provenance traces to an approved build system, and whether the components it declares are still considered safe as of right now rather than as of whenever the document was generated. Kubernetes admission controllers and policy-as-code tools can enforce exactly this today — refusing to schedule any image lacking a valid signature, with human review reserved for the exceptions the policy can't resolve automatically. The Part Nobody Wants to Hear SolarWinds remains the cautionary tale everyone reaches for, and fairly, the absence of meaningful supply-chain visibility let that compromise propagate to roughly 18,000 customers before anyone outside the attackers understood the scope. But I'd argue the more instructive lesson of the past two years is the opposite kind of failure: organizations that have an SBOM, dutifully generated at every release, sitting in a compliance folder nobody has reopened since. Cloudsmith's research into current practice keeps surfacing the same pattern — SBOMs produced once at build time and then never looked at again, which makes them a point-in-time artifact masquerading as an ongoing control. My honest prediction for the next eighteen months: the EU's reporting deadline this September is going to force more genuine automation into supply-chain pipelines than three years of SBOM evangelism managed on its own, simply because a 24-hour reporting clock doesn't tolerate a quarterly spreadsheet review. Regulation rarely produces elegant security architecture. It does, reliably, produce urgency — and on this particular problem, urgency has been in short supply for exactly the wrong reason: the list looked complete, so everyone assumed the kitchen was safe.
A multi-SLM platform creates value only when specialization does not introduce a new latency tier. Small language models are inexpensive enough to dedicate to focused work such as extraction, code handling, safety filtering, or short-form reasoning, but that advantage disappears if model selection itself becomes expensive. Research on LLM routing shows that query difficulty varies enough for model choice to materially affect efficiency and quality, and modern serving stacks expose enough control over routing, batching, and cache locality to turn that insight into an operational design rather than an academic one. In practice, the routing layer has to behave like a tiny data-plane decision engine, not like another inference hop. Why Multiple SLMs Need Routing A single small model rarely gives the best latency-quality trade-off for every prompt type. Short structured requests, such as JSON extraction and classification, differ sharply from code repair, and both differ again from prompts that need broader reasoning. RouteLLM describes routing as assigning simpler queries to weaker models and reserving stronger models for harder cases, while FrugalGPT reports that a learned cascade can preserve strong-model quality with very large cost reductions. Although those papers evaluate broader LLM portfolios, the underlying lesson transfers cleanly to a fleet of small specialized models: heterogeneity in request shape makes heterogeneity in model choice economically and operationally rational. That conclusion rules out a router that behaves like another generative model call. RouteLLM explicitly treats effective routing as a pre-decision that minimizes cost and latency relative to broader multi-model execution, which means the dominant path should remain inside in-memory feature extraction and lookup. Prompt length, requested output shape, language, code markers, safety category, session identity, and prior cache affinity are all signals that can be computed before any model is invoked. A practical design target is to keep that first decision under a millisecond, so its cost remains far below prefill and decode work. The moment the main path depends on an additional model inference, the latency budget starts competing with the very SLM call it is supposed to optimize. Keeping the Decision Path Short The cleanest design is a two-stage router. The first stage is deterministic and resolves obvious cases immediately. A short request demanding strict JSON can go to an extraction model. A prompt containing fenced code, compiler errors, or repository paths can go to a code model. A safety-sensitive request can be pinned to a policy model. Only when simple predicates fail to produce a confident mapping should the second stage run, and that second stage should be a lightweight complexity scorer rather than another generator. Ray Serve’s request-routing API is built around this kind of custom replica selection, and its FIFO mixin is specifically intended for algorithms that can route requests as soon as they arrive without waiting for content-heavy processing. That is the right shape for an ultra-low-latency router: deterministic fast path first, optional scorer second. A routing metadata object makes that design practical because it compresses request interpretation into cheap primitives: Java record RoutingContext( int tokenCount, boolean codeRequest, boolean structuredOutput, String language, boolean repeatedPrefix, double complexityScore ) {} This record is deliberately plain. Primitive fields are cheap to serialize, cheap to log, and easy to replay during debugging. That choice aligns with PyTorch and vLLM production notes on disaggregated serving, where complex metadata objects in scheduler paths increased serialization cost and hurt inter-token behavior, and it fits the general shape of request routers that repeatedly rank candidate replicas under load. The complexityScore field should therefore come from a compact classifier or calibrated heuristic trained offline on task outcomes, escalation rates, or preference labels, not from a runtime SLM call. The router’s intelligence belongs in the thresholds and features, not in an extra generation step. The routing function should then read like admission control rather than orchestration: Java ModelTarget route(RoutingContext ctx) { if (ctx.structuredOutput() && ctx.tokenCount() < 800) return ModelTarget.EXTRACTION_SLM; if (ctx.codeRequest()) return ModelTarget.CODE_SLM; if (ctx.complexityScore() > 0.72) return ModelTarget.REASONING_SLM; if (ctx.repeatedPrefix()) return ModelTarget.GENERAL_SLM_CACHE_HOT; return ModelTarget.GENERAL_SLM; } The important detail is ordering. The cheapest predicates run first, the optional scorer appears only after clear task signals have been checked, and cache affinity refines the generic path instead of overriding obvious specialization. That mirrors how high-performance request routers rank candidates and then filter out replicas that are already saturated. Thresholds should be calibrated from observed latency and task-success data, but the architectural rule is stable: most traffic should leave the router with a decision produced entirely from fields already in memory. Making Selection Cache-Aware Cache-aware selection is where routing often starts to produce visible latency gains. vLLM’s automatic prefix caching reuses KV cache from earlier queries when a new request shares the same prefix, allowing shared prompt computation to be skipped, and its design notes describe prefix caching as close to a free lunch because it avoids redundant work without changing outputs. SGLang reaches a similar result with RadixAttention, which keeps reusable KV state in a radix tree, adds LRU eviction, and applies cache-aware scheduling to improve hit rate while introducing only negligible overhead when no cache hit occurs. That combination matters because a fast model on a warm prefix can easily outperform a nominally better model on a cold path. Routing without cache awareness, therefore, leaves substantial latency savings on the table. That is why a field such as repeatedPrefix, promptFamilyId, or session hash belongs in the routing context. Ray Serve exposes locality-aware and multiplex-aware helpers so that requests can prefer nearby replicas or replicas that already hold the relevant model, and Meta’s PyTorch and vLLM production write-up reports that sticky routing of the same session to the same prefill host significantly boosts prefix-cache hit rate, reaching 40% to 50% hit rate in the described deployment. The practical lesson is broader than that specific architecture. Similar prompt families should be steered toward the same warm replicas whenever possible, even if a purely load-balanced policy would have spread them evenly. Equal distribution is not the same thing as minimal latency once KV reuse becomes available. Keeping the System Fast in Production Once the routing logic is correct, the queueing policy and replica shape become the next sources of latency. Triton documents that dynamic batching combines requests to maximize throughput and allows bounded queue delay, while concurrent model execution and instance groups allow multiple copies of the same model to run in parallel on selected devices. That argues for selective rather than universal batching. Short extraction or moderation SLMs often benefit from aggressive batching because their service time is small and predictable, while interactive reasoning models need tighter queue-delay bounds to prevent batching from inflating p95 latency. Replica placement matters as well. Heavy or frequently chosen models deserve more parallel instances, and cold-start penalties should be reduced through explicit warmup, since Triton notes that model warmup can prevent the slow initial inferences seen before a model is fully initialized. Backpressure and observability complete the design. Ray Serve supports bounded queues and load shedding through max_queued_requests, and its autoscaling guidance ties lower ongoing-request targets to tighter latency objectives. Ray Serve LLM also exposes request latency, throughput, TTFT, and TPOT, while Triton exposes Prometheus metrics for GPU and request behavior. Those signals should be segmented by routed model, decision path, cache-hit class, and warm versus cold replica so that routing regressions become visible before they surface as user-facing tail latency. Without route-level telemetry, an apparently accurate router can quietly push traffic onto cold replicas, oversized queues, or cache-miss-heavy paths. In a low-latency SLM system, observability is not just for debugging. It is the only reliable way to keep routing policy aligned with actual serving behavior. Conclusion An ultra-low-latency routing layer for multiple SLMs is best treated as a serving primitive rather than as a separate intelligence feature. The strongest design keeps most requests on a deterministic first stage, invokes a lightweight complexity scorer only for ambiguous prompts, represents route state with compact metadata, and treats prefix locality as a first-class selection signal. Around that core, warm replicas, selective batching, bounded queues, and route-level observability determine whether specialization actually improves latency or merely rearranges it. When routing is cheaper than a single token step and cache locality is preserved instead of ignored, a multi-SLM system stops looking like a collection of models and starts behaving like a disciplined low-latency inference fabric.
When we are demoing an agentic product, it always looks clean and clear: the agent pauses, the human approves or rejects, and execution continues. But what happens when the human actually says no? Human-in-the-loop (HITL) sounds like a single feature. In practice, it covers a wide design space: Do you pause mid-execution or notify asynchronously? Is the human a peer agent or an external approver?Can the human edit the action, or only approve or reject it?Does the framework resume execution exactly where it paused, or is there anything else? These questions yield different answers across all major agent frameworks, and those answers have very real production consequences. I assumed that all frameworks would converge on a single pattern for HITL design, but I found them to be very different. This article compares the six frameworks and their implementations of HITL. What You Will Learn By the end of this article, you will be able to: Distinguish the three fundamental HITL patterns - durable graph interrupt, message-loop injection, and blocking gate, and know which framework implements each.Read working code for all six frameworks and understand the exact execution pause and how it resumes for the frameworks.Pick the right framework for your use case. The Fundamental Divide The three distinct HITL patterns can be described as Durable graph interrupt: In this pattern, the execution graph serializes the entire graph state and suspends at the exact node where approval was needed. Nothing happens until a decision is made. If the process exits, then it's saved in an external checkpointer, and the run resumes from the point of suspension. Message loop injection: In this pattern, there is no suspension as such. Humans act as a first-class participant in a multi-agent conversation, steering a reply like any other agent. The loop runs continuously, and the human response is just another round.Blocking gate/run-termination: In this pattern, the framework runs or ends the run cleanly at a designated point, either blocking in process until the caller responds or terminating and returning an approval pending object that the human needs to resolve before resuming. Resuming the run is the human's responsibility. frameworkpatterntrue suspensionhuman can edit actionresumable after process restartdeepagentsGraph interrupt (LangGraph)✓✓ approve / edit / reject✓AgnoHumanReview on Step/Loop✓Partial✗AutoGenUserProxy agent (message loop)✗✓ via messages ✗OpenAI Agents SDKneeds_approval interruptPartial✗PartialCrewAIstep_callback + human_input on Task✗✗✗Pydantic AIDeferred tools (requires_approval)Partial✗✗ deepagents + LangGraph: graph-level interrupt Installation: Shell pip install deepagents langgraph # Python >=3.10 required # Docs: https://docs.langchain.com/oss/python/deepagents/human-in-the-loop deepagents resume/interrupt sequence deepagents uses LangGraph's interrupt()primitive. When the model produces a tool call that requires approval, execution suspends at that exact graph node. The serialized state is stored via a LangGraph checkpointer; the process can exit entirely and resume hours later. Wiring Up the Middleware Python from deepagents import create_deep_agent from langchain.agents.middleware import HumanInTheLoopMiddleware, InterruptOnConfig hitl = HumanInTheLoopMiddleware( interrupt_on={ # True = approve / edit / reject all allowed "delete_file": True, # Restrict to approve/reject only, with static description "run_bash": InterruptOnConfig( allowed_decisions=["approve", "reject"], description="Review this shell command before execution", ), # Dynamic description generated from the tool call at runtime "send_email": InterruptOnConfig( allowed_decisions=["approve", "edit", "reject"], description=lambda tool_call, state, runtime: ( f"Approve sending email to: {tool_call['args'].get('to')}" ), ), } ) agent = create_deep_agent( model="anthropic:claude-sonnet-4-6", middleware=[hitl], ) What the Reviewer Sees (HITLRequest Structure) Python # Surfaced to the reviewer when delete_file is triggered { "action_requests": [ { "name": "delete_file", "args": {"path": "/workspace/output.log"}, "description": "Tool execution requires approval\n\nTool: delete_file\nArgs: ..." } ], "review_configs": [ { "action_name": "delete_file", "allowed_decisions": ["approve", "edit", "reject"] } ] } The Three Decision Types Python from langgraph.types import Command # Approve — run as-is graph.invoke( Command(resume={"decisions": [{"type": "approve"}]}), config={"configurable": {"thread_id": "session-123"}, ) # Edit — change args before running graph.invoke( Command(resume={ "decisions": [{ "type": "edit", "edited_action": { "name": "delete_file", "args": {"path": "/workspace/old-backup.log"} } }] }), config={"configurable": {"thread_id": "session-123"}, ) # Reject — agent receives explanation and stops retrying graph.invoke( Command(resume={ "decisions": [{ "type": "reject", "message": "Do not delete production logs. Archive instead." }] }), config={"configurable": {"thread_id": "session-123"}, ) Multi-Tool Batching If the model calls two tools in the same response, deepagents batches them into a single HITLRequest. One round-trip will handle both: Python # Single interrupt — two pending actions simultaneously graph.invoke(Command(resume={ "decisions": [ {"type": "approve"}, {"type": "reject", "message": "rm -rf is too broad — use a specific path"} ] })) AutoGen v0.4: the UserProxy pattern Installation: Shell pip install autogen-agentchat autogen-ext # Docs: https://microsoft.github.io/autogen/stable/ AutoGen models the human as a UserProxyAgent which is a peer participant in multi-agent conversation. There is no suspension. The loop runs continuously, and the human turn is when the proxy injects a message. AutoGen message-loop HITL Python from autogen_agentchat.agents import AssistantAgent, UserProxyAgent from autogen_agentchat.teams import RoundRobinGroupChat from autogen_agentchat.conditions import TextMentionTermination from autogen_ext.models.openai import OpenAIChatCompletionClient assistant = AssistantAgent( "assistant", model_client=OpenAIChatCompletionClient(model="gpt-4o"), system_message=( "You are a helpful agent. Always describe what you are about to do " "and ask for confirmation before executing file operations." ), ) # input_func is called when the proxy needs human input # Replace `input` with an async queue for web applications user_proxy = UserProxyAgent("human", input_func=input) team = RoundRobinGroupChat( participants=[assistant, user_proxy], termination_condition=TextMentionTermination("DONE"), ) await team.run(task="Clean up old log files in /tmp") Limitation: The conversation loop never truly suspends. If the process exits mid-conversation, the state is lost. For async web UIs, you'd need a background thread and an asyncio queue to bridge human input. There's no built-in checkpointing. Agno: HumanReview on Steps Installation: Shell pip install agno # Docs: https://docs.agno.com/reference/workflows/step Agno's HITL uses a HumanReview config object attached to workflow steps. It supports confirmation gates before execution, user input collection, and post-execution output review: Python from agno.workflow import Workflow, Step from agno.workflow.types import HumanReview from agno.agent import Agent from agno.models.anthropic import Claude extract_agent = Agent(name="Extractor", model=Claude(id="claude-haiku-4-5"), ...) transform_agent = Agent(name="Transformer", model=Claude(id="claude-haiku-4-5"), ...) load_agent = Agent(name="Loader", model=Claude(id="claude-sonnet-4-6"), ...) workflow = Workflow( name="DataPipeline", steps=[ Step(name="extract", agent=extract_agent), Step(name="transform", agent=transform_agent), Step( name="load", agent=load_agent, # Pause and require human confirmation before this step runs human_review=HumanReview(requires_confirmation=True), ), Step( name="verify", agent=load_agent, # Pause after execution for a human to review the output human_review=HumanReview(requires_output_review=True), ), ], ) HumanReview fields: fieldscopewhat it doesrequires_confirmationStep, Loop, Router, ConditionPause before the step executesconfirmation_messageStep, Loop, Router, ConditionCustom prompt shown to the reviewerrequires_user_inputStep, RouterCollect freeform user input before continuingrequires_output_reviewStep, RouterPause after execution; accepts bool or Callable[[StepOutput], bool] for conditional reviewrequires_iteration_reviewLoop onlyReview after each loop iterationon_rejectAllOnReject.skip (default), cancel, or retry (re-run the step with human feedback)on_errorAllOnError.pause triggers HITL on step failure - human decides retry or skiptimeout / on_timeoutAllTimeout in seconds; on_timeout is cancel (default), skip, or approve Resumability: Agno has no workflow-level checkpoint equivalent to LangGraph's checkpointer. If the process exits while a step is awaiting human input, the workflow state is lost. Resumability requires external session storage wired by the caller. OpenAI Agents SDK: needs_approval interrupt Installation: Shell pip install openai-agents # Docs: https://openai.github.io/openai-agents-python/ The OpenAI Agents SDK uses a needs_approval parameter on function_tool. When set, the run loop pauses and surfaces a ToolApprovalItem that the caller approves or rejects via RunState: Python from agents import Agent, function_tool, Runner @function_tool(needs_approval=True) def delete_file(path: str) -> str: """Delete a file at the given path.""" import os os.remove(path) return f"Deleted {path}" # needs_approval can also be a callable for conditional approval @function_tool( needs_approval=lambda ctx, args, call_id: args.get("path", "").startswith("/prod") ) def write_file(path: str, content: str) -> str: """Write content to a file.""" with open(path, "w") as f: f.write(content) return f"Wrote {path}" agent = Agent( name="FileAgent", instructions="Help the user manage files.", tools=[delete_file, write_file], ) async def run_with_approval(): result = await Runner.run(agent, "Delete the old backup file") if result.interruptions: # Convert result to a resumable state, then resolve each pending approval state = result.to_state() for item in result.interruptions: print(f"Approve {item.raw_item.name}({item.raw_item.arguments})? [y/N]: ", end="") if input().strip().lower() == "y": state.approve(item) else: state.reject(item, rejection_message="User rejected this action") # Resume: pass the mutated state back to Runner result = await Runner.run(agent, state=state) print(result.final_output) Limitation: The approval flow is approve-or-reject only. There's no structured "edit" decision type. Humans cannot modify tool arguments through the SDK's approval mechanism. Partial cross-restart resumability is available via state.to_string() / RunState.from_string() and the human is responsible for persisting and restoring the serialized state externally. CrewAI: step_callback + human_input Installation: Shell pip install crewai # Docs: https://docs.crewai.com/en/concepts/crews CrewAI has two distinct mechanisms with very different semantics. step_callback — Observational Only step_callback fires after each agent step and receives an AgentAction | AgentFinish object. It cannot block or modify the next step: Python from crewai import Agent, Crew, Task from crewai.agents.crew_agent_executor import AgentAction, AgentFinish def review_step(step: AgentAction | AgentFinish) -> None: if isinstance(step, AgentAction): print(f"Tool used: {step.tool}, input: {step.tool_input}") elif isinstance(step, AgentFinish): print(f"Agent finished: {step.return_values}") researcher = Agent( role="Researcher", goal="Research the topic", backstory="An expert researcher.", verbose=True, ) task = Task(description="Research quantum computing trends", agent=researcher) crew = Crew( agents=[researcher], tasks=[task], step_callback=review_step, ) crew.kickoff() human_input — Blocking Task-Output Review Setting human_input=True on a Task does produce a real synchronous pause. After the agent finishes its work for that task, execution blocks on input() , and the human can provide free-form feedback before the output is finalized: Python from crewai import Agent, Crew, Task researcher = Agent( role="Researcher", goal="Research the topic", backstory="An expert researcher.", verbose=True, ) task = Task( description="Research quantum computing trends and summarize findings.", expected_output="A summary of the latest quantum computing developments.", agent=researcher, human_input=True, # blocks after agent finishes, before output is accepted ) crew = Crew(agents=[researcher], tasks=[task]) crew.kickoff() # Agent completes its work, then execution pauses: # > Please provide feedback on the agent's output (or press Enter to accept): Key distinction from tool-call-level gates: human_input fires after the agent has already finished the task and all tool calls have already executed. You are reviewing the output, not approving individual actions before they run. The human provides free-form text feedback. There is no structured approve/edit/reject schema, no async queue support, and no state serialization. Because it calls input() directly, it blocks the calling thread, and it is incompatible with async web servers (FastAPI, Starlette) without bridging to a separate thread and queue. Pydantic AI: Deferred Tools Installation: Shell pip install pydantic-ai # Docs: https://pydantic.dev/docs/ai/tools-toolsets/deferred-tools/ Pydantic AI has a first-class HITL primitive called Deferred Tools. Mark a tool with requires_approval=True (or raise ApprovalRequired conditionally) and the agent run terminates with a DeferredToolRequests object instead of a final answer. The caller resolves approvals and resumes with the original message history. Pydantic AI deferred-tool approval sequence Declaring Tools That Require Approval Python from pydantic_ai import Agent from pydantic_ai.exceptions import ApprovalRequired agent = Agent("anthropic:claude-sonnet-4-6") # Always requires approval @agent.tool(requires_approval=True) async def delete_file(ctx, path: str) -> str: import os os.remove(path) return f"Deleted {path}" # Conditional: only requires approval for destructive commands @agent.tool async def run_bash(ctx, command: str) -> str: import subprocess risky = ["rm", "drop", "truncate"] if any(r in command for r in risky): raise ApprovalRequired(metadata={"reason": "destructive command detected"}) return subprocess.check_output(command, shell=True).decode() Handling DeferredToolRequests and Resuming Python from pydantic_ai.tools import DeferredToolRequests, DeferredToolResults, ToolDenied async def run_with_approval(): result = await agent.run("Delete the old backup file") if isinstance(result.output, DeferredToolRequests): approvals = {} for tool_call in result.output.approvals: print(f"Approve {tool_call.tool_name}({tool_call.args})? [y/N]: ", end="") if input().strip().lower() == "y": approvals[tool_call.tool_call_id] = True else: # ToolDenied lets you pass a custom message back to the model approvals[tool_call.tool_call_id] = ToolDenied( message="User rejected this action - do not retry." ) # Resume: pass original message history + approval decisions result = await agent.run( message_history=result.all_messages(), deferred_tool_results=DeferredToolResults(approvals=approvals), ) print(result.output) Inline resolution with HandleDeferredToolCalls For cases where you want to resolve approvals within the same run (e.g., a CLI prompt that doesn't need to persist state), use the HandleDeferredToolCalls capability: Python from pydantic_ai.capabilities import HandleDeferredToolCalls from pydantic_ai.tools import DeferredToolRequests, DeferredToolResults, ToolDenied async def interactive_approver(ctx, requests: DeferredToolRequests) -> DeferredToolResults: approvals = {} for tool_call in requests.approvals: print(f"Approve {tool_call.tool_name}({tool_call.args})? [y/N]: ", end="") if input().strip().lower() == "y": approvals[tool_call.tool_call_id] = True else: approvals[tool_call.tool_call_id] = ToolDenied(message="Rejected by user.") return DeferredToolResults(approvals=approvals) agent = Agent( "anthropic:claude-sonnet-4-6", tools=[delete_file, run_bash], capabilities=[HandleDeferredToolCalls(interactive_approver)], ) Limitation: There is no durable state serialization. If the process exits between the first run (which returns DeferredToolRequests) and the resume, the run cannot be recovered. The caller(human) must persist result.all_messages() and the pending tool call IDs externally. There is also no structured "edit" decision type; the human can approve or deny, but cannot modify tool arguments through the SDK. Choosing the Right Pattern use casebest fit Long-running agent, async human reviewer with durable resume deepagents only Human needs to edit tool args before execution deepagents only Step-level gates with on_reject retry loops, no durable resume Agno HumanReview Conversational co-pilot, real-time back-and-forth AutoGen Approve/reject specific tools, run stays in-process OpenAI Agents SDK Approve/reject specific tools, run terminates for async handling Pydantic AI Deferred Tools Audit logging, no blocking needed CrewAI step_callback Review task output after agent finishes (not tool-call level) CrewAI human_input=True Conclusion There is no universal answer to HITL in agent frameworks. The right choice depends on three questions before choosing your framework: at what granularity does a human need to intervene (tool call, step, or task output), whether the reviewer responds in real time or hours later, and whether you need the process to survive a restart between the interrupt and the resume. If the answer to any of the last two is "yes," deepagents with a LangGraph checkpointer is the only framework that handles both today. For everything else, the landscape is richer than it first appears: Pydantic AI's Deferred Tools give you structured tool-call-level approval without a graph runtime; Agno gives you powerful step-level gates with retry semantics; and OpenAI Agents SDK gives you the simplest possible approve/reject path when you control the process lifecycle. The mistake most teams make is treating HITL as an afterthought. The primitives each framework exposes are not interchangeable, and switching from an observational callback to a durable interrupt requires rearchitecting the execution model, not just swapping a parameter. The decision tree above is meant to surface that choice before it becomes expensive to undo.
Modern semantic search, retrieval-augmented generation (RAG) pipelines, and large-scale recommendation models heavily rely on embeddings — transformations of natural language text into dense numeric representations called vectors. These embeddings position semantically related text in nearby regions of vector space. It enables similarity computation through distant metrices such as Cosine similarity or Euclidean distance. Cloud-hosted services like OpenAI has text-embedding-ada-002 provide high-quality vector encodings. But it comes with API keys, network latency, and per-token usage costs. In contrast, LocalEmbeddingService does all the computation within hosted process, no GPUs, no outbound requests, no model files to manage. The method it uses is called the hashing trick (or feature hashing). The same algorithm is implemented in scikit-learn’s HashingVectorizer. 1. Contract: IEmbeddingService C# public class LocalEmbeddingService : IEmbeddingService { public int Dimensions => 512; The service creates 512-dimensional float vectors. This is intentional. It is large enough to capture document semantics yet small enough for in-memory dot-product similarity searches across millions of vectors. These dimensions can be increased to 1024 or 2048, but will require additional GPU and memory usage. 2. Stop Words C# private static readonly HashSet<string> StopWords = new(StopAnalyzer.ENGLISH_STOP_WORDS_SET, StringComparer.OrdinalIgnoreCase); Stop words are common high-frequency words like “and”, “the”, “is”, and “while”. It does contain minimal/no semantic information, but can heavily influence vectorized output if these are not filtered. In the above code, Lucene.NET’s nuget package is used, instead of hardcoding, which has a predefined set StopAnalyzer.ENGLISH_STOP_WORDS_SET. It is well curated and validated. The set is wrapped in HashSet<string> with OrdinalIgnoreCase which provides fast case-insensitive lookup without any extra allocation at query time. 3. Text Cleaning — Tokenization C# private static Dictionary<string, int> Tokenize(string text) { var freq = new Dictionary<string, int>(StringComparer.OrdinalIgnoreCase); var tokens = text .ToLowerInvariant() .Split(new[] { ' ', '\t', '\n', '\r', ',', '.', '!', '?', ';', ':', '"', '\'', '(', ')', '[', ']', '{', '}', '-', '_', '/', '\\' }, StringSplitOptions.RemoveEmptyEntries) .Where(t => t.Length > 2 && !StopWords.Contains(t)); foreach (var token in tokens) freq[token] = freq.GetValueOrDefault(token) + 1; return freq; } Tokenization is the very first step of text cleaning. Each word has to go through it. It has 3 main things. Lowercasing: It keeps all the words in lower case. “System” and “system” have the same meaning.Split based on delimiter/punctuation: Each delimiter/punctuation is considered as a word boundary. “top-of-the-line” will become [“top”, “line”] after splitting and removing stop words.Filtering: If the tokens are less than 3 characters, then they will be skipped with stop words. After tokenization, it gives a term-frequency map like { "compute": 2, "learn": 3, "embedding": 1, … }. 4. Hashing Trick/Feature Hashing The core challenge here is the size of real-world vocabularies. There are millions of distinct terms. It makes it almost impossible to allocate a separate vector dimension per term/token. Hashing tricks solve this problem by hashing tokens directly into a bounded index range via a hash function. It will eliminate the need to store a vocabulary. C# private static int StableBucket(string token, int size) { unchecked { uint hash = 2166136261u; // FNV offset basis foreach (char c in token) { hash ^= (byte)c; hash *= 16777619u; // FNV prime } return (int)(hash % (uint)size); } } Here FNV-1a (Fowler–Noll–Vo) hash function is used. It is a lightweight, non-cryptographic hash ideal for short strings with excellent bit distribution. It uses two canonical constants. FNV offset basis: Decimal: 2166136261, Hex: 0x811C9DC5FNV prime: Decimal: 16777619, Hex: 0x01000193 Each character is processed by XOR-ing the current hash with the character’s byte value. Then it is multiplied by FNV prime. The XOR-then-Multiply order ensures every byte influences 32 bits, improving avalanche behavior for short tokens like English words. Here .NET’s string.GetHashCode() is not useful because it randomizes per process run against hash flooding attacks. The StableBucket is required to return same bucket indices across every run for deterministic 32-bit results. The use of unchecked in C# ensures overflow checking for 32-bit integer semantics. 5. Log-Based TF Normalization C# float weight = MathF.Log(1f + count); Term frequency does not scale linearly with semantic importance. For example, a word/term that appears 10 times in a document is not actually 10 times more important that the term appears once. When the log log(1 + count) is applied, it compresses the raw frequency. The table below shows how this log-based frequency works. This ensures that no single repeated term disproportionately shapes the embedding, the same reasoning behind TF-IDF in traditional information retrieval systems. 6. Trigram Features for Morphology Capture C# if (token.Length >= 4) { for (int i = 0; i <= token.Length - 3; i++) { string trigram = token[i..(i + 3)]; int trigramBucket = StableBucket(trigram, Dimensions); vector[trigramBucket] += weight * 0.5f; } } Whole world hashing can produce hard edge cases for terms like “play”, “player”, “playing”. These terms are treated as separate features and land in different buckets. Trigrams help to reconnect them and smooth out these gaps. Here are trigrams for “playing” and “player”. C# playing - pla, lay, ayi, yin, ing player - pla, lay, aye, yer Here, common trigrams like pla and lay cause both terms to accumulate weight in some of the same hashed buckets, which pulls their vectors closer in embedding space. The half weight (o.5f ) ensures that trigram features do not dominate the whole-word signal. 7. L2 Vector Normalization — Cosine Similarity via Direct Dot Products C# private static void NormalizeL2(float[] vector) { float magnitude = 0f; foreach (float v in vector) magnitude += v * v; magnitude = MathF.Sqrt(magnitude); if (magnitude > 0f) for (int i = 0; i < vector.Length; i++) vector[i] /= magnitude; } Once all token and trigram weights have been applied, the resulting vector is normalized so that its Euclidean length equals 1. This normalization enables a key mathematical identity: C# cosine_similarity(a, b) = a · b when ‖a‖ = ‖b‖ = 1 When vectors are already L2-normalized, the cosine similarity is evaluated using the raw dot product operation, eliminating the need for any division. 8. Utility: GetTopTokenWeights C# public Dictionary<string, float> GetTopTokenWeights(string text, int topN = 10) { var tokenFreq = Tokenize(text); return tokenFreq .Select(kv => new { Token = kv.Key, Weight = MathF.Log(1f + kv.Value) }) .OrderByDescending(x => x.Weight) .Take(topN) .ToDictionary(x => x.Token, x => x.Weight); } This diagnosis method highlights the tokens that contributed most to the final embeddings. It provides critical insight into why two documents achieve high similarity scores and confirms that the stop word removal and tokenization are working as expected. Limitations and Production Enhancements This service is fully deterministic, fast, and requires zero supporting infrastructure. It performs well for vocabulary-driven similarity — cases where documents share the same vocabulary. It does not encode semantic relationships. For example, “car” and “sedan” will end up in separate buckets and will not have the same similarity score. For production-grade semantic search, LocalEmbeddingService can be replaced with either OpenAI or a local ONNX sentence transformer. The shared IEmbeddingService interface by both implementations ensures that no code change is required for any components like API Controllers, vector index, or retrieval logic. Project repository: TextEmbeddingService
Abstract Modern distributed systems rarely fail in isolation — they degrade across multiple execution steps. This article presents a control-loop-based architecture for building self-healing systems that detect anomalies early, precisely isolate failures, and automatically recover using context-aware decisions. Introduction Modern distributed systems are large-scale platforms built on service-oriented architecture. In such systems, an individual request — the unit of execution — typically flows through multiple services, including clients (request initiators), orchestrators, enrichment layers, validation or policy-evaluation systems, routing layers, downstream dependencies, state management systems, reconciliation processes, and notification systems. Each service in this chain introduces latency, retries, dependencies, and failure modes. Because of this, failures in distributed systems rarely appear as clean, isolated events. Instead, they emerge as a sequence of interacting issues that create a cascading effect across the system. For example, a downstream dependency may become slow in a specific region. This increases retries, which in turn increases queue depth. The growing queue depth puts pressure on the orchestrator, eventually causing it to fail unrelated requests due to resource saturation. What initially was a local dependency problem rapidly turned into a widespread degradation of workflow. This problem is particularly difficult in asynchronous systems, where failures are not always instantly visible. A request may not fail instantly — it may remain pending, miss its expected execution window, be delayed in execution, get stuck in an intermediate state, or lose coordination between system components. When the operator detects the issue, the impact could already be large enough. However, traditional protection mechanisms such as fixed failure thresholds, static alerts, and global circuit breakers are often too coarse-grained for these scenarios. A localized dependency failure should not halt the entire system. At the same time, localized issues must not be allowed to trigger storms or cascade into otherwise healthy execution paths. The goal, therefore, is to build a self-healing control system that can detect anomalies at the level of individual requests, aggregate signals across execution and system dimensions, isolate only the affected scope, and recover gradually based on real-time evidence. This post presents such a system. It is designed to provide predictive anomaly detection, hierarchical aggregation, scoped and global kill switches, adaptive leaky-bucket flow control, observability, and AI-assisted investigation and escalation. featurestatic thresholds (old way)context-aware loops (new way)DetectionStatic ThresholdingPredictive Anomaly DetectionContainmentGlobalScopedControlBinary ShutdownAdaptive Flow ControlRecoveryManualEvidence-Based Self-Healing Why Traditional Systems With Static Thresholds Won’t Work Most distributed systems rely on mechanisms like retries, dead-letter queues, alerts, and circuit breakers. These are useful but not enough for complex async workflows as they depend on static thresholds, which are context-blind by nature. A rule like “trigger an alert when failures exceed X%” cannot distinguish between fundamentally different types of failures: Logical failures, where a request completes but produces an incorrect result due to issues in input, configuration, or application logic Execution failures, where a request produces no result due to delays, retries, or loss of coordination across system components For example, in an AI inference system, a request may return an incorrect response due to model configuration issues (logical failure), or it may be accepted but never complete due to stalled execution in downstream components (execution failure). Static thresholds treat both cases uniformly, even though they require very different responses. As a result, systems either overreact to expected failures or miss critical anomalies such as stuck or silently failing requests. Failure volume alone is also a weak signal. A small number of failures could be highly significant if those requests were anticipated to be successful. For instance, if requests following the same execution path have historically resulted in high reliability, even a few failures in that cohort can imply a serious issue. Static thresholds also lack scope awareness. A local failure example, requests routed through a particular execution path, dependency, or region, should not cause a global shutdown. However, a pattern of small anomalies across different paths, regions, or request classes could indicate a larger systemic problem, even if no single threshold is crossed. For instance, in an inference system, requests served by a specific model variant may observe increased latency or degraded outputs due to recent changes to configurations or parameters, while other models and request paths continue to function normally. These limitations are amplified in asynchronous systems, where failures are not always specific. Coordination gaps can cause requests to be stuck, delayed, retried multiple times, or enter into inconsistent states. This leads to higher latency, missed completion signals, or repeated retries with no progress. These weaknesses are further revealed during recovery. AI Agents or operators have to manually inspect logs and dashboards to determine when to resume traffic, resulting in inconsistent performance, slowness, and reactive recovery. In summary, these challenges demonstrate that static thresholding is not sufficient for modern distributed systems. What is needed is a system that understands request context, expected behavior, and the scope of the anomaly. This leads to a fundamental shift in system design: Static thresholding → Predictive anomaly detection Global containment → Scoped containment Binary shutdown → Adaptive flow control Manual recovery → Evidence-based self-healing Instead of asking: Are requests failing? The system should ask: Are requests behaving as expected within their defined SLA, given their execution context and expected outcomes? System Architecture as a Control Loop The system functions as a control loop during request execution. It does not replace the execution path. Instead, it constantly monitors the system's behavior, predicts expected outcomes, identifies deviations, and makes control decisions based on real-time signals. Orchestrated Execution With Continuous Monitoring A primary orchestrator drives the system. It executes each request through a series of steps. At each step, the orchestrator calls on one or more downstream systems, either synchronously or asynchronously. These downstream systems may have their own dependencies. As the request moves forward, it carries contextual metadata like tenant class, region, request type, execution path, and routing decisions. This context defines how the request should behave at each step or at a specific point. While the orchestrator manages execution, anomaly detection serves as a continuous control layer throughout these steps. It tracks the outcome of each phase to ensure that the request moves forward as expected and that the contextual integrity remains intact. Context Preservation and Signal Collection At every step, the system captures signals such as latency, retries, routing decisions, execution status, and downstream responses. It also augments the request with derived attributes such as execution path identifiers and historical behavior patterns. This ensures that each request is evaluated relative to similar cohorts, and more importantly, allows the system to identify where deviations occur within the execution flow — not just whether the request ultimately fails. Success Prediction Engine Intuition: The system learns what 'normal' looks like for similar requests and uses that to estimate expected outcomes. The system estimates how likely a request is to succeed based on its context and historical behavior. For each request i, the expected success is computed as: Plain Text P_i = P(success | x_i) Where: x_i = request features (context, routing path, system state) P_i = expected probability of success This establishes what should happen at different stages of execution, allowing the system to detect deviations between expected and actual outcomes throughout the request lifecycle. Step-Level Anomaly Detection Unlike traditional systems that evaluate only final success or failure, this system continuously monitors each critical step of execution. A request may: Be accepted but delayed Be routed to an unexpected path Experience retries at a specific step Produce degraded output Fail to progress beyond a step By evaluating these signals against expected behavior for that request’s context, the system can detect anomalies early and pinpoint the exact step where deviation occurs. Inference Example (Grounding) For example, in an inference system, the orchestrator can direct a request from a certain tenant class to a summarization model in a certain subnet of a region. If that subnet/region experiences network latency, requests may still be accepted and processed, but exhibit higher latency or delayed responses. In this case, the orchestrator continues execution, but a specific step — model execution in that region — is deviating from expected behavior. Other models or regions may continue to function normally. Hierarchical Roll-up Counters The hierarchical roll-up model aggregates anomalies across multiple contextual dimensions. When a request deviates from expected behavior at any step, the system updates counters across relevant dimensions such as dependency, execution path, tenant class, and region. Example roll-ups: Plain Text (dependency, request_type) (dependency,request_type, tenant_class) (dependency, region) (execution_path, request_type) (global) A single anomalous request may update multiple roll-ups simultaneously. For example, a request routed to a summarization model in a latency-affected region may update: Plain Text (summarizer_model, tenant_class_A, region_us_west) (summarizer_model, region_us_west) (summarizer_model, tenant_class_A) (global) This multi-dimensional view allows the system to isolate issues precisely while still capturing broader systemic patterns. Roll-Up Configuration Model Each roll-up is independently configurable, allowing the system to adapt thresholds and behavior based on the criticality of different execution paths and request classes. Example configuration: JSON { "roll-up_id": "dependency_request_type_region", "dimensions": ["dependency", "request_type", "region"], "threshold": 25, "tumbling_window": "30m", "parent_roll-up_ids": [ "dependency_region", "dependency_request_type", "dependency", "global" ], "control_action": "HOLD_AND_PROBE" } Key Fields dimensions → define how the rollup key is constructed threshold → anomaly count required to trigger tumbling_window → fixed evaluation window (e.g., 30 minutes) parent_rollup_ids → defines relationships across rollups control_action → action applied when this rollup becomes the resolved scope Hierarchical Rollup Model (DAG) The hierarchy is modeled as a directed acyclic graph (DAG). This allows a granular rollup to contribute to multiple parent views. For example: Plain Text (dependency=D1, request_type=TYPE_A, region=EU) → (dependency=D1, region=EU) → (dependency=D1, request_type=TYPE_A) → (dependency=D1) → (global) A single anomalous request may update multiple rollups simultaneously, including both child and parent scopes. Rollup Runtime State At runtime, each rollup key maintains its own state within a tumbling window: Plain Text Rollup: (dependency, region) Key: D1:EU Window: 30 mins Anomaly Count: 35 Threshold: 25 → FIRED Each rollup evaluates independently: A child rollup may fire without the parent firing A parent rollup may fire when anomalies are distributed across multiple children Parent Roll-up Escalation Guard Since parent roll-ups aggregate signals, the system must prevent escalation caused by a single noisy child. Instead of maintaining a full child-level state, each parent tracks lightweight signals: parent_anomaly_countimpacted_child_countmax_child_contribution_ratio A parent roll-up is considered impacted only when: Plain Text parent_anomaly_count >= parent_threshold AND impacted_child_count >= min_required_children AND max_child_contribution_ratio <= max_allowed_ratio Example: Do not escalate at the parent level if only the request Type_A is failing. Plain Text TYPE_A = 100 anomalies TYPE_B = 0 TYPE_C = 0 Parent count = 100 Impacted children = 1 → Keep control at child level Example: Escalate. Plain Text TYPE_A = 40 TYPE_B = 35 TYPE_C = 25 Parent count = 100 Impacted children = 3 → Escalate to parent scope Why This Matters This ensures: Localized issues remain scoped Distributed anomalies are escalated correctly. Noisy signals do not trigger unnecessary global actions Anomaly Detection Engine The anomaly detection engine identifies unexpected deviations by comparing predicted outcomes and actual results and propagates these signals to rollup counters. A request is marked anomalous only if it was expected to succeed but deviates from expected behavior: Plain Text Anomaly_i = 1 if P_i ≥ τ AND Y_i deviates from expected outcome Where: Pi = predicted success probability Yi = observed outcome (failure, delay, degraded output, etc.) Each anomalous request updates multiple rollups across dimensions such as dependency, region, request type, and tenant class. The system evaluates all rollups that breach their thresholds and resolves the appropriate control scope. It then: Deduplicates overlapping signals Selects the highest meaningful level in the hierarchy Avoids redundant or conflicting controls This ensures: Localized issues remain scoped Correlated anomalies are elevated appropriately Duplicate control actions are avoided Kill Switch Controller The kill switch controller enforces control actions at the resolved anomaly scope. Based on severity and scope, it determines whether to: Stop new incoming requests within the scope Hold in-progress requests before critical downstream steps Allow controlled traffic via throttling or probing Control Actions Plain Text ALLOW → continue processing HOLD → pause new and in-progress requests THROTTLE → limit request rate PROBE → allow controlled traffic REROUTE → send via alternate path ESCALATE → trigger alerts / human intervention The controller applies actions consistently across the resolved scope, ensuring full containment without partial or conflicting behavior. Adaptive Recovery Strategy Once a control action is applied, the system does not immediately resume normal traffic. Instead, it gradually reintroduces traffic using a probing strategy. For example: Plain Text Step 1: allow 1 request Step 2: if successful (actual outcome == predicted outcome, allow 2 Step 3: if stable, allow 5 Step 4: gradually increase Step 5: if failures reappear, reduce or stop Recovery is guided by: Plain Text Recovery_G = Successful_G / Released_G Where: G = impacted roll-up scope This ensures: Safe and gradual recovery Avoidance of sudden failure spikes Validation of real system behavior Observability and Audit Layer The system captures all signals across execution: Predicted outcome Actual outcome Anomaly classification Impacted rollups Resolved scope Control action Recovery state These signals provide visibility into: Anomaly trends Active control scopes Held vs released requests Recovery progress This ensures full transparency, debuggability, and auditability. AI Control Plane The AI control plane operates outside the execution path and complements deterministic control logic. It consumes: Anomaly signals Roll-ups Deployment changes System health Control decisions It performs: Investigation → correlates anomalies with systems or changes Automated remediation → triggers safe rollback Escalation → notifies relevant teams Summarization → generates incident insights Key Separation Plain Text Decision Plane → deterministic (prediction, anomaly detection, control) AI Control Plane → intelligent (analysis, remediation, escalation) Conclusion Modern distributed systems cannot rely on static thresholds and reactive controls. Failures are often contextual, asynchronous, and distributed across multiple execution paths. This architecture introduces a fundamental shift: From failure counting → context-aware detection From global shutdown → scoped containment From reactive response → adaptive, evidence-based recovery By combining prediction, hierarchical rollups, scoped control, and adaptive recovery, the system can precisely isolate deviations, minimize impact, and restore stability safely. The core idea is simple but powerful: Systems should not just detect failures — they should continuously understand system behavior, localize deviations in context, and adapt in real time to maintain reliability. What’s Next: From Architecture to Code Designing the architecture is only the first step. In the next post, we move from the blueprint to the technical implementation, diving deep into: The State Machine: Managing high-cardinality counters without latency and affecting execution path.The Escalation Guard: Pseudo-code to prevent "noisy neighbor" failures.Adaptive Recovery: The logarithmic logic for safe traffic re-introduction. Stay tuned for the implementation deep-dive. Case Study: Applying the Control Loop to a Multi-Region Inference System End-to-end Example: Inference system with scoped control and adaptive recovery This example illustrates how anomalies propagate, how scope is resolved, and how control and recovery are applied in an inference system. Step 1: Incoming Requests Requests are routed by the orchestrator to model services in the DUB region: Plain Text (model=summarizer_v2, tenant_class=A, region=DUB) (model=translator_v1, tenant_class=A, region=DUB) (model=qa_model_v3, tenant_class=A, region=DUB) Predicted success: Pi≈0.95+ Step 2: Deviations → Anomalies Due to network degradation in DUB, requests begin to show: increased latency delayed responses occasional degraded outputs Yi deviates and Pi≥τ⇒Anomalyi=1Y_i \text{ deviates and } P_i \geq \tau \Rightarrow Anomaly_i = 1. Step 3: Roll-up Updates Each anomalous request updates multiple rollups: Plain Text (summarizer_v2, tenant=A, DUB) → 40 (translator_v1, tenant=A, DUB) → 35 (qa_model_v3, tenant=A, DUB) → 25 (region=DUB) → 100 Step 4: Parent Escalation Guard Plain Text parent_count = 100 impacted_child_count = 3 max_child_ratio ≈ 40% Since anomalies are distributed across multiple models, not concentrated in one: Plain Text → Escalate to (region=DUB) Step 5: Impact Resolution Fired roll-ups: Plain Text (summarizer_v2, tenant=A, DUB) (translator_v1, tenant=A, DUB) (qa_model_v3, tenant=A, DUB) (region=DUB) Resolved scope: Plain Text (region=DUB) Child rollups are de-duplicated and consolidated under the parent scope. Step 6: Control (Scoped Isolation + Reroute + Local Probing) Action: Plain Text HOLD_AND_PROBE + REROUTE Effect: Throttle or hold most requests routed to DUB Reroute the majority of traffic to FRA only after verifying that the region has sufficient available capacity and is operating within stable limits.Allow a small number of low-impact requests to continue via DUB as probes These probe requests validate whether the issue is transient or persistent without exposing the system to large-scale risk. Step 7: Adaptive Recovery Traffic is managed dynamically: Plain Text DUB (probe path): 1 → 2 → 5 → gradual increase FRA (rerouted path): handles majority of traffic Recovery signal: RecoveryG = SuccessfulGReleasedGRecovery_G = \frac{Successful_G}{Released_G} If probe requests via DUB succeed → gradually restore DUB traffic If failures persist → continue routing to FRA and reduce DUB probes Step 8: AI Control Plane Based on observed signals: Regional network issue → continue routing to FRA Model deployment issue → rollback model version Infrastructure saturation → rebalance across regions Transient degradation → generate summary without escalation Key Takeaways Failures are localized but distributed across modelsControl is applied at the correct scope (region-level)System avoids global shutdownRecovery is validated through controlled probingTraffic is dynamically rerouted and restored The system does not simply stop traffic-it isolates the impacted scope, reroutes intelligently, and verifies recovery through controlled probing before storing normal behavior.
The 20 Software Engineering Laws
June 30, 2026
by
CORE
The New Senior Developer Job Description: Half Engineer, Half AI Systems Architect
June 30, 2026
by
CORE
Fine-Tuning LLMs at Scale With Databricks MLflow and Spark
June 30, 2026
by
CORE
The New Senior Developer Job Description: Half Engineer, Half AI Systems Architect
June 30, 2026
by
CORE
A Low-Latency Routing Pattern for Multiple Small Language Models
June 30, 2026 by
An Ingredient List Doesn't Stop the Worm: What SBOMs Can and Can't Do
June 30, 2026
by
CORE
A Low-Latency Routing Pattern for Multiple Small Language Models
June 30, 2026 by
Beyond Static Thresholds: Building Self-Healing Systems via Context-Aware Control Loops
June 29, 2026 by
June 29, 2026 by
The New Senior Developer Job Description: Half Engineer, Half AI Systems Architect
June 30, 2026
by
CORE