Getting Started With GitHub Copilot CLI for Coding Tasks
Building a Multi-Agent Orchestration Capability: Architecture and Code Walkthrough
Platform Engineering and DevOps
Platform engineering and DevOps are merging as organizations scale, modernize, and push to reduce cognitive load across increasingly complex systems. What began as fragmented internal tooling has evolved into Platform-as-a-Product thinking, where internal developer platforms (IDPs), automation pipelines, and golden paths provide the backbone of modern DevOps workflows. Platform teams, DevOps engineers, security teams, and SREs are now working together to deliver consistent, secure, and self-service experiences that improve developer productivity and satisfaction and reinforce operational reliability.This report examines how platform engineering is reshaping DevOps by standardizing environments, unifying toolchains, and shifting repetitive tasks into automated workflows. We explore how teams are implementing developer experience (DevEx) metrics, rethinking CI/CD pipelines, and leveraging AI-driven automation to optimize infrastructure performance and enhance delivery velocity. As enterprises link platform health to business outcomes, measuring ROI and platform adoption is becoming a core initiative.
Shipping Production-Grade AI Agents
Threat Modeling Core Practices
Most AI failures in products do not happen because the model is weak. They happen because the model is guessing in the dark. A large language model can write code, summarize meetings, draft emails, generate reports, and answer customer questions. But when it does not know which customer, which contract, which policy, which ticket, which version of the truth, or which permission boundary applies, it will still produce a confident answer. That answer may look polished. It may also be wrong. This is why the next serious conversation in AI product development is not only about better models. It is about better context engineering. Context engineering is the delicate art and science of filling the context window with just the right information for the next step. (Source: Andrej Karpathy, quoted by Simon Willison) For teams integrating AI into real products and workflows, this is the shift that matters: stop treating context as prompt decoration. Start treating it as production infrastructure. The Real Problem Is Not Intelligence, But Relevance Imagine a sales manager asks an AI assistant: "Prepare me for tomorrow’s customer renewal meeting." A generic chatbot may produce a nice checklist: agenda, talking points, objections, next steps. Useful? Maybe. But a context-aware AI system would know much more. It would know the customer has three open support tickets. It would see that the renewal is due next month. It would find the customer’s last complaint about onboarding delays. It would pull the latest usage trend from analytics. It would avoid showing internal pricing notes if the user does not have access. Same model. Very different outcome. The difference is not magic. It is context. In enterprise AI, the question is no longer "Can the model answer?" The question is "Can the system supply the right facts, relationships, constraints, and permissions before the model answers?" That is context engineering. RAG Was the First Bridge Retrieval-augmented generation, or RAG, became popular because it solved a painful problem: models do not know your private business data. The basic idea is simple. You take documents, split them into chunks, convert them into embeddings, store them in a vector database, and retrieve similar chunks when a user asks a question. This works well for direct lookup tasks. Ask: "What is our refund policy for enterprise customers?" The system retrieves the most relevant policy text. The model writes the answer. That is a big improvement over asking the model to rely on training data or memory. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG), models which combine pre-trained parametric and non-parametric memory for language generation. Source: Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks But basic RAG has limits. It often retrieves text that is semantically similar but not operationally correct. It may miss relationships across systems. It may return stale documents. It may retrieve too much. It may retrieve data the user should not see. It may not know that "ACME Ltd.", "ACME EMEA", and "ACME renewal account" are part of the same customer story. That is why many RAG prototypes look impressive in demos but struggle in production. Better Context Is Not More Context A common mistake is to solve bad AI output by adding more documents to the prompt. More policy documents. More tickets. More meeting notes. More wiki pages. More logs. This feels reasonable, but it often makes the system worse. Large context windows are useful, but they are not a replacement for relevance. Models can still miss important information when the input is long, noisy, or poorly ordered. We find that performance can degrade significantly when changing the position of relevant information, indicating that current language models do not robustly make use of information in long input contexts. Source: Liu et al., Lost in the Middle: How Language Models Use Long Contexts This matters for product teams. If an AI assistant needs to answer a compliance question, dumping ten policy PDFs into the prompt is not context engineering. It is context flooding. Context engineering means selecting the minimum useful context that helps the model complete the task safely and accurately. Think of it like preparing a senior architect for a design review. You would not hand over every Slack thread, every ticket, every diagram, and every log file. You would prepare the right system diagram, the latest decision record, known risks, performance constraints, and the current business goal. AI needs the same discipline. GraphRAG Helps AI Understand Relationships Basic RAG is good at finding similar text. GraphRAG is better when the answer depends on relationships. For example, consider a customer success AI assistant. A user asks: "Why is this enterprise customer at risk?" Basic RAG may retrieve documents mentioning "risk", "renewal", or the customer name. GraphRAG can go further. It can connect the customer to products, incidents, support tickets, account owners, contract dates, usage patterns, regions, and unresolved escalations. That graph creates structure. Vector search then fills in the details. Our approach uses an LLM to build a graph-based text index in two stages: first to derive an entity knowledge graph from the source documents, then to pregenerate community summaries for all groups of closely-related entities. Source: Edge et al., "From Local to Global: A Graph RAG Approach to Query-Focused Summarization" This is especially useful for questions that are not simple lookups. "What are the main themes across customer complaints?""Which suppliers affect this delayed order?""What engineering risks are linked to this migration?""Which product issues are impacting renewals?" These are not just search questions. They are sense-making questions. GraphRAG gives AI a map, not just a pile of pages. Agentic RAG Makes Retrieval Iterative In simple RAG, retrieval usually happens once. User asks. System retrieves. Model answers. But real work is rarely that clean. An analyst may ask for a market brief. The AI retrieves three documents, notices that pricing is missing, searches again, checks a recent customer thread, compares it with the CRM record, and then writes the brief. That is agentic RAG. The agent does not only answer. It decides what context it still needs. This pattern is powerful, but it also raises the bar for governance. The more an AI system can search, call tools, and combine sources, the more important access control, audit logs, policy checks, and response validation become. In other words, agentic AI without runtime governance is not an assistant. It is a risk multiplier. The 4 Pillars of Production Context Engineering A strong AI system needs four capabilities. First, connected access. The AI must reach the right systems: databases, document stores, APIs, SaaS tools, data warehouses, and event streams. In modern enterprises, useful context rarely lives in one place. Second, a knowledge layer. Raw data is not enough. The system needs entities, relationships, hierarchies, definitions, ownership, and institutional memory. This is where knowledge graphs, metadata, taxonomies, and business rules become valuable. Third, precision retrieval. The system must retrieve by intent, role, time, freshness, source quality, and task. The goal is not the biggest prompt. The goal is the cleanest signal. Fourth, runtime governance. Access control must apply when data is retrieved and when the answer is produced. A model should not leak restricted information simply because it was available somewhere in the pipeline. This is where enterprise architecture experience matters. AI performance is not just a model selection problem. It is a systems design problem. A Simple Product Example Suppose you are building an AI assistant for a product data platform. A user asks: "Create an enriched product description for this laptop." A weak system sends the product title to a model and asks for a description. A better RAG system retrieves product specs and category guidelines. A GraphRAG system understands that this laptop belongs to a series, has related accessories, uses a specific processor family, maps to a category taxonomy, and must follow marketplace-specific rules. A context-engineered system does all of that, then checks language requirements, brand rules, region constraints, content quality thresholds, and user permissions. That is the difference between a cool demo and a production-grade AI workflow. The AI Moat Is Moving Models are becoming more capable and more accessible, which means the durable advantage will come from how well a company connects its AI to trusted context. The winners will not be the teams with the longest prompts. They will be the teams with the best context pipelines, clean retrieval strategies, strong knowledge layers, and live governance. RAG helps models remember your data. GraphRAG helps them understand relationships. Agentic RAG helps them search iteratively. Context engineering brings all of it together into a system that can be trusted in real workflows. The future of AI performance will not be won by asking, "Which model should we use?" It will be won by asking, "What does the model need to know, where should that knowledge come from, and is it allowed to use it?" For teams modernizing platforms, adding GenAI to products or scaling AI-first workflows, this is the work that matters now. The next time an AI feature gives a generic, wrong, or risky answer, do not only blame the model. Look at the context. That is probably where the real architectural problem begins. For more practical thinking on AI-first enterprise architecture, legacy modernization, event-driven platforms, and GenAI systems, readers can connect with Faisal Feroz on LinkedIn or explore his writing on his blog.
Retrieval-augmented generation (RAG) is now the default pattern for grounding large language models in private or domain-specific knowledge. Yet most RAG systems still hallucinate, and the cause is rarely the model itself. It is the retrieval step. A language model can only reason over the passages it is handed; when retrieval returns an incomplete or disconnected set of passages, the model quietly fills the gaps with plausible-sounding but unsupported text. The retrieval layer, in other words, is where trustworthiness is won or lost. This article examines a specific architectural idea — relationship-aware retrieval — and how it addresses the retrieval weaknesses that lead to hallucination. The reference implementation is RudraDB-Opin, a free, relationship-aware vector database. RudraDB-Opin is the free edition built for learning, prototyping, and real projects: it supports up to 100,000 vectors and 500,000 relationships — ample room to model a substantial knowledge base and demonstrate every retrieval pattern discussed here. The Weak Link in RAG Is Similarity-Only Retrieval A conventional RAG pipeline embeds each document chunk into a vector, stores the vectors, and at query time returns the k chunks whose embeddings are closest to the query embedding. This works well when the answer lives inside a single passage that happens to be lexically and semantically similar to the question. It breaks down in a common and important case: when the information needed to answer correctly is connected to the matched passage rather than being similar to it. Consider a troubleshooting knowledge base. A user asks why a service intermittently returns timeout errors. Similarity search retrieves the passage describing the timeout symptom. But the actual remediation lives in a separate passage about a connection-pool setting — a passage that shares almost no vocabulary with the question and therefore ranks low. The cause-and-effect link between the symptom and the fix is real, but a similarity index has no representation of it. The model receives the symptom without the cause, and a hallucinated remedy is the predictable result. This is the structural blind spot of similarity-only retrieval: it can find passages that look alike, but it cannot find passages that belong together. Adding the Missing Dimension: Relationships Between Items Relationship-aware retrieval keeps similarity search and adds an explicit model of how items relate to one another. In RudraDB-Opin, every stored vector can carry typed, directional relationships to other vectors, and search can traverse those relationships to assemble context that similarity alone would never surface. RudraDB-Opin defines five relationship types, each mapping to a connection pattern that appears repeatedly in real knowledge bases: Relationship typewhat it capturestypical rag usesemanticTopical or meaning-based connectionRelated articles, alternative explanations of the same concepthierarchicalParent–child or part-of structureA concept and its prerequisites; a section and its subsectionstemporalOrder or time-based progressionSequential steps, course, or workflow ordercausalCause-and-effect or problem–solutionSymptom-to-fix, question-to-answer, trigger-to-outcomeassociativeGeneral, looser associationCross-references, "see also," recommendations The important property is that these relationships are first-class data, not a side effect of vector proximity. A causal link between a symptom passage and its fix exists in the graph, whether or not the two embeddings are close. Retrieval can therefore follow the link directly. How Relationship-Aware Retrieval Works in Practice RudraDB-Opin is intentionally zero-configuration. It detects the embedding dimension from the first vector you add, so it works with any model — Sentence Transformers (384-D), HuggingFace models (768-D), OpenAI embeddings (1536-D) — without any setup change. Start by installing the package and loading your chunks: Python pip install rudradb-opin Python import rudradb import numpy as np from sentence_transformers import SentenceTransformer model = SentenceTransformer("all-MiniLM-L6-v2") # 384-dimensional embeddings db = rudradb.RudraDB() # dimension auto-detected on first add docs = { "timeout_symptom": "The service intermittently returns HTTP 504 timeout errors under load.", "pool_setting": "Connection pool exhaustion causes requests to queue until they time out.", "pool_fix": "Raise the maximum pool size and reduce idle connection lifetime.", "load_testing": "Reproduce timeouts by driving concurrent requests above steady-state traffic.", } for doc_id, text in docs.items(): embedding = model.encode(text).astype(np.float32) db.add_vector(doc_id, embedding, {"text": text}) print(f"Embedding dimension auto-detected: {db.dimension()}D") Next, model the connections you already understand about your own content. This is the step that separates relationship-aware retrieval from everything else: you encode the structure of the knowledge, not just its surface text. Python # The symptom is caused by pool exhaustion (cause-effect) db.add_relationship("timeout_symptom", "pool_setting", "causal", 0.9) # Pool exhaustion is resolved by the fix (problem-solution) db.add_relationship("pool_setting", "pool_fix", "causal", 0.9) # Load testing is the procedure used to reproduce the symptom (related procedure) db.add_relationship("timeout_symptom", "load_testing", "associative", 0.6) Now run a relationship-aware search. Enabling include_relationships tells the engine to rank and expand results using both vector similarity and the relationships you modeled, while max_hops bounds how far it will traverse: Python query = "Why does the service keep timing out and how do I fix it?" q_emb = model.encode(query).astype(np.float32) results = db.search(q_emb, rudradb.SearchParams( top_k=5, include_relationships=True, # traverse modeled relationships, not just cosine distance max_hops=2, # reach context up to two relationships away relationship_weight=0.3 # blend similarity score with relationship strength )) The query is most similar to the symptom passage, so similarity alone would return that passage and stop. Relationship-aware search instead follows the causal edges outward — symptom → cause → fix — and brings the remediation into the result set even though it is lexically distant from the question. You can inspect exactly how a passage connects to the rest of the graph, which is invaluable when debugging retrieval quality: Python for vector, hops in db.get_connected_vectors("timeout_symptom", max_hops=2): reach = "direct match" if hops == 0 else f"{hops}-hop connection" print(f"{vector['id']:18} {reach}") The retrieved passages — the matched chunk plus the chunks reached through its relationships — are then passed into the LLM prompt as grounded context, exactly as in any RAG pipeline. The difference is that the context is now complete with respect to the question. Why Relationships Reduce Hallucination Relationship-aware retrieval attacks hallucination at its source — incomplete or incoherent context — in three concrete ways. It closes context gaps. Hallucination most often occurs when an answer requires a fact that retrieval failed to supply, forcing the model to invent it. By following hierarchical and causal relationships, the retriever delivers the prerequisite definition, the upstream cause, or the corresponding solution alongside the matched passage. The model is no longer asked to bridge a gap; the bridge was retrieved with the rest of the evidence. It grounds answers in explicit, traceable connections. A similarity score is a statistical proximity, not a statement of fact. A modeled relationship is an explicit assertion — this symptom is caused by that setting — authored against your real knowledge. When the answer follows a chain of declared relationships, every step of the supporting context can be traced back to a connection that someone deliberately encoded, rather than to incidental vector closeness. That traceability is what makes the retrieved context auditable. It keeps the retrieved context coherent. Temporal and hierarchical relationships preserve order and structure. When a question concerns a multi-step process, sequential relationships ensure the steps arrive together and in order, instead of as a scattered set of independently top-ranked fragments. Coherent context produces coherent answers and removes a frequent trigger for the model to "smooth over" missing or out-of-order steps. Why It Fetches More Relevant Information There is a difference between similar and relevant. Similarity measures resemblance; relevance measures whether a passage actually helps answer the question. The two overlap, but they are not the same — and similarity-only retrieval optimizes for the wrong one whenever the truly useful passage does not resemble the query. Relationship-aware retrieval recovers relevance that similarity ranking discards. A prerequisite concept, a downstream consequence, or a paired solution is often highly relevant while being only loosely similar. Because RudraDB-Opin can reach those passages through relationships — and blends relationship strength with similarity via relationship_weight — it surfaces context that a pure vector ranking would push far down the list or omit entirely. In practice, this means fewer "the answer was in the knowledge base, but the model never saw it" failures. RudraDB-Opin vs. Traditional and Hybrid Vector Databases It is worth being precise about what relationship-aware retrieval adds relative to the two retrieval architectures most teams already use. CapabilityTraditional Vector DatabaseHybrid Traditional Vector DatabaseRudraDB-OpinCore retrieval primitiveDense vector similarity (e.g., cosine)Dense similarity + lexical/keyword (e.g., BM25) + metadata filtersDense similarity + typed relationships + bounded multi-hop traversalModel of connections between itemsNone — items are independentNone — items are still scored independentlyFirst-class: 5 relationship types (semantic, hierarchical, temporal, causal, associative)Retrieving related-but-dissimilar contextMissedCaught only when vocabulary overlapsReached by following modeled relationshipsMulti-hop contextNot supportedNot supportedSupported (up to 2 hops in the Opin edition)Typical effect on RAGContext gaps the model may fill by guessingFewer lexical-mismatch gaps; connection gaps remainPrerequisite, causal, and sequential context retrieved togetherSetupSpecify dimension, build indexDimension + index + analyzers/weightsZero-config; embedding dimension auto-detected A Traditional Vector Database ranks every chunk independently by embedding distance. A Hybrid Traditional Vector Database improves recall on vocabulary mismatch by adding keyword search and metadata filtering on top of similarity — a genuine improvement, but one that still scores each item in isolation. Neither architecture has any notion that one chunk depends on, causes, or precedes another. RudraDB-Opin adds exactly that missing layer: an explicit, traversable model of the connections between items, which is precisely the structure RAG needs to retrieve complete and coherent context. Where RudraDB-Opin Fits RudraDB-Opin is the free edition, distributed under an MIT license and built for learning, tutorials, hackathons, and proof-of-concept work. Its 100,000-vector, 500,000-relationship capacity is generous enough to prototype a relationship-aware retriever, validate the pattern against your own content, and benchmark it against a similarity-only baseline. It works on Windows, macOS, and Linux, requires only Python 3.8+ and NumPy, and integrates with the embedding stacks teams already use, including OpenAI, HuggingFace, Sentence Transformers, and LangChain. When a prototype outgrows that envelope, the data and the API carry forward to the full RudraDB for production scale, so the modeling work done at the prototype stage is not thrown away. Conclusion RAG quality is a retrieval problem before it is a model problem, and similarity-only retrieval has a structural blind spot: it cannot represent how pieces of knowledge connect. Relationship-aware retrieval closes that gap by treating connections — semantic, hierarchical, temporal, causal, and associative — as first-class, traversable data. The result is context that is more complete, more coherent, and more genuinely relevant, which is the most direct lever available for reducing hallucination in a grounded system. RudraDB-Opin makes the pattern tangible in a few lines of code, with the capacity to back a real prototype. Key takeaways: Most RAG hallucinations originate in retrieval, not generation; incomplete context forces the model to guess.Similarity finds passages that look alike; relationships find passages that belong together — and both matter.Modeling typed relationships lets retrieval follow causal, hierarchical, and sequential links to context that similarity ranking misses.Traditional and hybrid vector databases score items independently; relationship-aware retrieval adds the connection layer RAG needs. RudraDB-Opin — Learn more, read the documentation, and install the free package at https://www.rudradb.com.
When I first started building AI applications, I kept hearing the same words everywhere: workflows, agents, and multi-agent systems. At first, they all sounded like different labels for the same thing. After all, in every case, you are still calling an LLM, sending some context, and getting something back. That assumption turns out to be one of the easiest ways to design the wrong system. Once you start building real projects, the difference becomes very obvious. Some systems need strict control. Some need flexibility. Some need multiple specialized roles. If you choose the wrong model, you usually pay for it in cost, reliability, debugging pain, or unnecessary complexity. This is the explanation I wish I had when I started. I want to keep it beginner-friendly, but also useful enough that you can apply it in real projects without walking away with the usual “everything is an agent” confusion. Workflow vs Agent vs Multi-Agent System The simplest way to understand the whole topic is this: A workflow is when you decide the steps in advance. An agent is a model that decides what to do next. A multi-agent system is one in which multiple agents, usually with different roles, coordinate to solve a larger problem. That core distinction aligns closely with external references: workflows follow predefined code paths, while agents dynamically direct their own tool usage and execution flow. That sounds simple, but it becomes much clearer with a relatable example. Imagine you are ordering pizza. In a workflow, the restaurant follows a script. They ask for size, toppings, crust, and address in a fixed sequence. It is fast, reliable, and predictable. In an agent-style system, you might say, “I’m hungry, and I want something good for movie night,” and the system figures out whether you usually order vegetarian, whether you want something quick, whether it should ask a follow-up question, and what option best fits your past behavior. In a multi-agent setup, one specialist handles the order, another checks ingredient availability, and another optimizes delivery timing. Each one does a narrower job, but together they solve a broader problem. That is the real difference. The question is not whether all three use AI. The question is who is controlling the process. What a Workflow Really Is A workflow is the most structured option. You define the steps, the order, and often the failure points. The model may still do useful work inside the system, but the system itself is not making open-ended decisions about how to proceed. Think of it like a recipe. Step one happens first. Step two happens second. If something goes wrong, you usually know where it happened. A simple example is a blog post generator that deliberately separates outline generation, introduction writing, body drafting, and final assembly. TypeScript import Anthropic from '@anthropic-ai/sdk'; const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY }); async function generateBlogPost(topic: string) { const outlineResponse = await client.messages.create({ model: 'claude-3-5-sonnet-20241022', max_tokens: 1024, messages: [ { role: 'user', content: `Create a blog post outline about: ${topic}` } ] }); const outline = outlineResponse.content[0].text; console.log('Step 1: Outline created'); const introResponse = await client.messages.create({ model: 'claude-3-5-sonnet-20241022', max_tokens: 1024, messages: [ { role: 'user', content: `Based on this outline, write an introduction:\n\n${outline}` } ] }); const intro = introResponse.content[0].text; console.log('Step 2: Introduction written'); const bodyResponse = await client.messages.create({ model: 'claude-3-5-sonnet-20241022', max_tokens: 2048, messages: [ { role: 'user', content: `Based on this outline, write the body:\n\n${outline}` } ] }); const body = bodyResponse.content[0].text; console.log('Step 3: Body written'); return `${intro}\n\n${body}`; } The reason workflows dominate production is not that teams lack ambition. It is that predefined orchestration is easier to reason about. Predictable systems are easier to test, monitor, certify, and price. That is exactly why guidance around production AI systems keeps steering builders toward workflows first, especially for reliability-critical environments. The referenced material also repeatedly points out that workflows are the better fit when requirements are stable, boundaries are clear, and reliability matters more than open-ended autonomy. That makes workflows a very strong fit for document processing, onboarding, report generation, fixed moderation pipelines, approval chains, and regulated systems. What an Agent Really Is An agent changes one important thing. Instead of hardcoding the order of operations, you give the model a goal, a set of tools, and enough context to decide what should happen next. That is where the flexibility comes from. The model can inspect the task, choose a tool, look at the result, decide whether another tool is needed, and continue until it reaches a stopping point. That pattern is what makes an agent feel more like a smart assistant than a pipeline. The external guides describe this clearly as dynamic decision-making, autonomous tool selection, reasoning, and self-directed task execution. A simple research assistant is a good example for beginners. TypeScript import Anthropic from '@anthropic-ai/sdk'; const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY }); const tools = [ { name: 'search_web', description: 'Search the web for information about a topic', input_schema: { type: 'object', properties: { query: { type: 'string' } }, required: ['query'] } }, { name: 'save_notes', description: 'Save research notes to a file', input_schema: { type: 'object', properties: { notes: { type: 'string' } }, required: ['notes'] } } ]; async function searchWeb(query: string): Promise<string> { return `Results for ${query}`; } async function saveNotes(notes: string): Promise<void> { console.log(`Saved notes: ${notes.slice(0, 80)}...`); } async function researchAgent(topic: string) { const messages: any[] = [ { role: 'user', content: `Research ${topic} and save comprehensive notes.` } ]; let done = false; while (!done) { const response = await client.messages.create({ model: 'claude-3-5-sonnet-20241022', max_tokens: 4096, tools, messages }); if (response.stop_reason === 'tool_use') { const toolUse = response.content.find( (block: any) => block.type === 'tool_use' ); if (toolUse.name === 'search_web') { const results = await searchWeb(toolUse.input.query); messages.push({ role: 'assistant', content: response.content }); messages.push({ role: 'user', content: [ { type: 'tool_result', tool_use_id: toolUse.id, content: results } ] }); } if (toolUse.name === 'save_notes') { await saveNotes(toolUse.input.notes); done = true; } } else { done = true; } } } What matters here is not the SDK syntax. What matters is that you did not hardcode “search first, summarize second, save last.” The agent decides that. It may search once. It may search five times. It may decide it has enough information early. That is precisely why agents are useful for research, support, exploratory planning, and other tasks where you cannot fully predict the required path ahead of time. The trade-off is that you lose some of the certainty that workflows give you. The number of tool calls can vary. The runtime can vary. The cost can vary. If something behaves strangely, you often need stronger logs and better observability to understand why. Seeing the Difference Side by Side One of the best parts of your attached draft was the side-by-side review analysis example, because it shows the difference without abstract theory. That absolutely deserves to stay. Suppose the task is to analyze a customer review and generate a response. The workflow version might look like this. TypeScript async function analyzeReviewWorkflow(review: string) { const sentiment = await callLLM( `Analyze sentiment of this review as positive, negative, or neutral: ${review}` ); const topics = await callLLM( `Extract the main topics from this review: ${review}` ); const response = await callLLM( `Generate a customer support response for a ${sentiment} review about ${topics}` ); return { sentiment, topics, response }; } This is clean and efficient. It makes the same three calls every time. The cost is predictable. The behavior is stable. It is also rigid. A weird review gets handled through the same path as a normal one. Now compare that with an agent version. TypeScript async function analyzeReviewAgent(review: string) { return await runAgent({ task: `Analyze this review and generate a support response: ${review}`, tools: [ 'check_sentiment', 'extract_topics', 'search_knowledge_base', 'generate_response' ] }); } Now the system can decide whether a highly emotional complaint requires a knowledge base lookup before responding, while a simple positive review may only require sentiment classification and a thank-you response. That flexibility is exactly what makes agents attractive. It is also what makes them less predictable. This is one of the most important beginner lessons in the whole topic. A workflow handles every case with the same planned path. An agent adapts its path to the case. When Workflows Are the Better Choice This is where most of the production reality sits. If you know the exact steps, a workflow is almost always the first thing you should build. If predictability matters, a workflow is usually safer. If cost matters, workflows are easier to manage because you know roughly how many model calls happen per run. For debugging, workflows are easier because every state transition is explicit. That is also why modern workflow-oriented systems emphasize type safety, checkpointing, durable execution, human-approval steps, and clear routing. Those capabilities are not flashy, but they are exactly what real teams need when a system runs in production for weeks or months. A customer onboarding pipeline is a simple example. TypeScript async function onboardCustomer(email: string) { await sendWelcomeEmail(email); await createAccount(email); await setupDefaultPreferences(email); await sendTutorial(email); } A document processing pipeline is another. TypeScript async function processDocument(pdfPath: string) { const text = await extractText(pdfPath); const summary = await summarize(text); const keywords = await extractKeywords(text); await saveToDatabase({ text, summary, keywords }); await notifyUser(); } A content moderation flow is another good fit. TypeScript async function moderatePost(post: string) { const isSpam = await checkSpam(post); const isToxic = await checkToxicity(post); return isSpam || isToxic ? 'reject' : 'approve'; } None of these tasks benefits much from letting the model invent the control flow on the fly. They benefit from clean orchestration. When Agents Are the Better Choice Agents make more sense when the task is open-ended, when the path cannot be fully predefined, or when adaptability matters more than deterministic execution. Customer support is a classic example because every issue arrives in a different way. Research is another reason because you do not know in advance which leads will be useful. Trip planning is another challenge because different users, constraints, budgets, dates, and preferences change the best route through the task. A travel helper captures this nicely. TypeScript async function travelAgent(request: string) { return await runAgent({ task: `Help the user with this travel request: ${request}`, tools: [ 'search_flights', 'search_hotels', 'get_weather', 'suggest_itinerary', 'ask_followup_question' ] }); } The system may begin by asking a clarifying question. It may check the weather before hotels. It may avoid hotel search entirely if the user says they are staying with friends. This is exactly the sort of context-dependent behavior that agents are designed for. The guides also specifically call out use cases like deep research, agentic RAG, customer support, virtual assistants, and coding assistants as agent-friendly territory. What Multi-Agent Systems Add Multi-agent systems take the idea one step further. Instead of having one agent handle everything, you split the work among multiple specialists. This matters when specialization actually improves the result. One agent might research. Another might write. Another might review or validate. The Inkeep article makes an important distinction: true multi-agent systems are not just a sequential workflow with different names for each step. The key idea is autonomous coordination between specialized agents, often through direct communication or delegated responsibilities. A simple content team example makes this concrete. TypeScript async function researchAgent(topic: string) { return callLLM(`Research ${topic}. Return key facts, trends, and context.`); } async function writerAgent(research: string, topic: string) { return callLLM(`Using this research, write an article about ${topic}:\n${research}`); } async function editorAgent(article: string) { return callLLM(`Edit this article for clarity, accuracy, and flow:\n${article}`); } async function contentCreationTeam(topic: string) { const research = await researchAgent(topic); const draft = await writerAgent(research, topic); const final = await editorAgent(draft); return final; } This is still a simple coordinator-led version, but it shows the value of specialization. A more advanced system might allow the editor to request a revision from the writer, or the writer to request more supporting evidence from the researcher. That is where multi-agent systems start to feel like collaborative problem-solving rather than a chain of prompts. The caution here is important. Multi-agent systems are not “the next level” you should jump to just because they sound advanced. They introduce more moving parts, more coordination overhead, more debugging complexity, and higher cost. They are useful when the problem actually needs multiple kinds of expertise, not when you are just trying to make a simple app look more impressive. The Practical Decision Model A good beginner question is not “which one is the smartest?” It is “how much uncertainty does this task have, and who should own the decision-making?” If the task is well-defined and stable, start with a workflow. If the task is open-ended and the system needs to choose how to proceed, consider an agent. If the task genuinely benefits from multiple specialists with separate responsibilities, consider multiple agents. That decision model lines up closely with the source material as well. Use workflows when requirements are clear, control is important, cost matters, and debugging stays simple. Use agents when tasks are exploratory, human-like reasoning is valuable, and adaptability matters more than fixed control flow. Use multi-agent systems when a single reasoning unit is no longer sufficient to capture the problem's diversity. The Beginner Mistakes That Cost Time and Money The first mistake is using agents for simple tasks that should be handled by normal code or a fixed workflow. If you want to add two numbers, do not build an agent. If you want to categorize simple support tickets with a stable schema, start with a workflow. Not every AI problem needs autonomy. TypeScript function addNumbers(a: number, b: number) { return a + b; } The second mistake is forcing a workflow onto a task that clearly needs adaptation. Creative writing, research, and support escalation often branch in ways that are hard to encode cleanly in advance. If you keep adding if-statements and exception paths to rescue a rigid workflow, that is often a sign the task wants agent behavior. The third mistake is building multi-agent systems too early. Three agents for a simple email writer is usually just an expensive ceremony. You should earn that complexity by hitting a real need first. These mistakes sound obvious when written down, but they are very common because the AI space rewards novelty in demos more than maintainability in products. The Cost Conversation Matters More Than People Admit A workflow-based newsletter creator might always make three model calls, one for the intro, one for the main copy, and one for the closing section. That means the cost per run is fairly easy to estimate. TypeScript async function createNewsletter(topics: string[]) { const intro = await generateIntro(topics); const articles = await generateArticles(topics); const outro = await generateOutro(); return { intro, articles, outro }; } An agent-based newsletter creator might decide it needs extra research, then rewrite one section twice, then call another tool to validate tone. Sometimes that flexibility is useful, but it also means cost and latency can move around more than you expect. TypeScript async function newsletterAgent(topics: string[]) { return runAgent({ task: `Create a newsletter about these topics: ${topics.join(', ')}`, tools: ['research_topic', 'draft_section', 'revise_section', 'validate_tone'] }); } That does not automatically make agents bad. It just means the operational model is different. The broader production guidance on workflows versus agents keeps coming back to exactly this point: deterministic systems are easier to budget for, observe, and control. The Hybrid Model Is Usually the Best Answer This is probably the most useful real-world takeaway in the entire topic. You do not have to choose one pattern forever. Many successful systems use workflows to structure the outer system and agents only where flexibility is genuinely needed. The Prompt Engineering Guide explicitly recommends hybrid approaches, such as using workflows for structure and agents for open-ended subtasks. That pattern looks like this. TypeScript async function smartCustomerSupport(message: string) { const category = await categorize(message); if (category === 'simple_faq') { return faqWorkflow(message); } if (category === 'complex_issue') { return supportAgent(message); } return escalateToHuman(message); } This is a very practical architecture. The workflow gives you control, routing, and predictability. The agent only appears where variability is too high for rigid orchestration. That means you keep the system understandable while still benefiting from adaptive behavior. If you are building beginner-to-intermediate AI products, this is one of the best mental models to adopt early. A Cleaner Way to Think About Real Projects A document processor usually wants a workflow because the same stages repeat every time. A support assistant may want an agent because issues differ, and tool selection depends on context. A software delivery assistant might eventually become a multi-agent system if planning, implementation, testing, and review are separate responsibilities that benefit from specialization. Here is a simplified example of that last case. TypeScript async function developFeature(requirement: string) { const specs = await productManagerAgent(requirement); const code = await developerAgent(specs); const testResults = await qaAgent(code); if (!testResults.passed) { return developerAgent(`Fix these issues:\n${testResults.issues}`); } return code; } This kind of setup can make sense, but only if the complexity is real. It should come from the nature of the work, not from the desire to use more agents. Conclusion If you are just starting, build a workflow first. That advice is not anti-agent. It is pro-clarity. Workflows teach you how to decompose tasks, define boundaries, measure outcomes, and understand where AI actually adds value. Once you understand the stable parts of your system, it becomes much easier to identify the unstable parts that may benefit from an agent. Once you understand where one agent becomes overloaded, it becomes much easier to justify multiple specialized agents. That progression is healthier than starting with maximum autonomy and then trying to reverse-engineer stability later. So my practical rule is simple. If the task can be described as a sequence of reliable steps, use a workflow. If the system needs to decide the steps as it goes, use an agent. If the problem truly needs multiple specialized minds working together, then and only then reach for a multi-agent design. The best AI systems are not the ones with the most autonomy. They are the ones that stay understandable when something goes wrong.
Why Long Chats Need Session-Level Guardrails (CRA) Who this is for: Anyone building chat features, support bots, internal Q&A, coaching tools, RAG assistants. The Usual Setup (and What It Misses) A typical flow: User sends a message.You run moderation, rules, or a small model on that message (sometimes the reply too).If it passes, the big model answers. That is per message. It does not really “remember” the story of the chat. In a long chat: Message 5 looks normal.Message 12 still passes your keyword list.By message 20, something is wrong only if you compare it to how the chat started. So you can pass every single check and still end up with a bad session. That gap is what we call CRA: risk that adds up across turns, not in one obvious line. Figure 1: Each turn can look “green” while the overall thread is not. CRA in Plain English CRA = Conversational Risk Accumulation Idea: Each turn might look okay on its own, but together they break the purpose of the chat or what your company is okay with. What to build: Keep a little session memory (not the full transcript in logs — think IDs, hashes, and scores). After each assistant reply, update a few numbers that describe “how this session feels right now.” Those numbers are hints for dashboards, alerts, and gentle UI — not a courtroom verdict. Three Simple Scores + One Total (Example) We use a small, fixed set of scores and one combined score. Version tag in code: cra_telemetry_v1. Figure 2: Three inputs, one combined CRA score. ScorePlain meaningHow you might compute it (conceptually)S1Topic driftCompare the user’s recent text to how the chat started (or a stated goal). If they wander far from that, S1 goes up.S2Sensitive-looking repliesThe assistant’s answer looks like it contains patterns you care about (fake email shapes, “API key” wording, etc.). This means “flag for review,” not “we proved a leak.”S3Refusal tone shiftingTrack refusal-style phrases in the assistant’s answers over time. If refusals seem to soften late in the thread, S3 captures that shape.CRAOverall session riskA weighted sum of S1, S2, and S3, plus a small extra bump if the user or assistant text looks like prompt injection playbooks. Example weights we used: 35% S1, 45% S2, 20% S3. Rule of thumb: If you cannot explain a score in one short sentence to a product manager, do not use it to auto-block users. Hard Guardrails = Simple, Fast, “No” Hard guardrails are rules, not vibes. They should be cheap and run before you waste tokens. Examples: Max request size – reject giant payloads (HTTP 413).Rate limits – cap requests per IP so one client cannot drain your budget (429).Known-bad phrases – block obvious “ignore all previous instructions” junk (400).“Don’t paste secrets” – block prompts that look like “here is my SSN” (400) with a clear error.Lock down outputs – if your product only allows certain actions, check model output and tool calls against an allowlist before anything runs. These are not CRA. They are basics. CRA sits beside them. Figure 3: Hard = block or validate. Soft = warn, log, nudge. Soft Guardrails = CRA-Friendly, “Heads Up” Soft means: warn, log, maybe show a banner — not silent blocking. After a response, the API can add fields such as: cra_soft_notices – short text for humans (“high drift”, “sensitive-looking wording”, …).cra_signals – numbers for debugging: S1, S2, S3, CRA, turn count. Why start soft: Rules and heuristics misfire. A user might ask for fake email examples for a demo; S2 might spike on purpose. That is why the score is a signal, not proof. Bonus: Cache Duplicate Questions (Save Money) If someone double-clicks Send or retries the same text, do not call the model twice. Cache key idea: Python normalize(question) + mode + endpoint Cache the JSON answer for a few minutes. Mark responses with something like cached: true so the UI can say “from cache.” Browser Tip: Don’t Mix Up “New Chat” and Old Intent If S1 uses “first message of this session” as the anchor, browser storage can fool you: a new tab can look like a new thread while an old “first message” is still stored. Fixes: Store the anchor per session_id, not one global value.Expire or rotate the browser session after idle time so deploys and stale tabs do not reuse the wrong anchor. Telemetry vs. Guardrails (Two Different Jobs) TelemetryGuardrailJobMeasure and learnBlock or change behaviorWhen it hurts youToo many logs, privacyFalse positives, angry usersCRAGood fitUse soft first; hard only after review In logs, avoid raw secrets. Prefer hashes, lengths, and labels (channel, product area). Three Lines for Your Security Reviewer CRA is about conversation behavior over time, not a replacement for database security or tool-permission design.Labels for “bad session” are rare in the real world — use CRA to prioritize review, not as automatic guilt.If weights are public, people might game them — keep basic hard rules and spot checks anyway. Rollout Order (Keep It Boring) Ship hard limits (size, rate, obvious injection, output checks).Add session logging with safe IDs.Show soft notices only inside internal tools first.Tune thresholds on real traffic.Only then add hard session actions (pause tools, re-auth, etc.). Takeaway One-message checks are not enough for long chats. CRA gives you a simple story and a small set of session scores. Hard rules stop obvious abuse; soft CRA helps you see drift before it becomes an incident. Start with telemetry. Add blocking only when you understand the false positives. About the author: Sanjay Mishra is author of two books, The SQL Universe and Oracle Database Performance Tuning: A Checklist Approach. His research spans RAG architectures, NL2SQL, LLM safety, and enterprise AI governance, with work published in IEEE Access, Springer LNNS, and SSRN. He speaks regularly at universities and industry events on applied AI and data engineering. Tags / topics: #LLM #Security #Guardrails #Observability #OpenAI #Architecture #Chatbots
This is the first article in a 6-part series on building practical, responsible AI audit workflows with RAI Audit Kit, an open-source Python package suite. The series will move from foundational AI systems to more advanced and production-oriented audit workflows: Launching RAI Audit Kit – why evidence-grade responsible AI audits matterAuditing ML systems – fairness, drift, data quality, and robustnessAuditing deep learning systems – image models, medical imaging, robustness, and explainabilityAuditing LLM and RAG systems – prompt injection, faithfulness, citations, and retrieval securityAuditing AI agents – tool use, memory, permissions, and trace safetyAdding audit gates to CI/CD – turning audit results into engineering controls This first article introduces the project, the problem it is designed to solve, and how the package suite is structured. Why Responsible AI Audits Need Better Tooling AI systems are becoming more complex. A few years ago, many teams mainly worried about model accuracy. Today, the picture is much broader. Modern AI systems may include tabular machine learning models, deep learning pipelines, LLM applications, RAG systems, and AI agents that call tools or use memory. That means AI evaluation can no longer stop at: “Is the model accurate?” A better question is: “Can we show evidence that this AI system was evaluated for fairness, robustness, drift, data quality, safety, security, and traceability?” In many teams, this evidence is scattered across notebooks, scripts, screenshots, spreadsheets, and manual review documents. That makes audits hard to reproduce and harder to compare across versions. Responsible AI needs to become part of normal engineering workflows. That is why I built the RAI Audit Kit. What Is the RAI Audit Kit? RAI Audit Kit is an open-source Python package suite for responsible, secure, and trustworthy AI audits. The goal is to help developers and AI teams run repeatable audits, generate structured findings, preserve evidence, and export useful reports. It is designed to support different types of AI systems, including: Classical machine learningDeep learningLLM applicationsRAG systemsAgentic AI workflows The package can help generate outputs such as findings, evidence manifests, model cards, audit reports, and CI/CD-friendly results. Install: PowerShell pip install rai-audit-kit Full install: PowerShell pip install "rai-audit-kit[all]" Package Architecture RAI Audit Kit is organized as a suite of smaller packages: PackagePurposerai-audit-coreReports, findings, evidence, model cards, audit history, and CI gatesrai-audit-mlFairness, drift, data quality, and robustness checks for tabular MLrai-audit-dlDeep learning, image, medical imaging, robustness, and explainability auditsrai-audit-llmLLM and RAG audits for prompt injection, toxicity, faithfulness, citations, and retrieval securityrai-audit-agentsAgent audits for tools, memory, permissions, prompt injection, and trace behaviorrai-audit-kitMeta-package for unified installation and CLI usage The structure is modular because responsible AI is not a single problem. A tabular ML system has different risks from a deep learning model. A RAG application has different risks from an autonomous agent. The suite is designed to keep those workflows connected while still allowing each package to focus on its own risk area. Quick Start A basic CLI workflow looks like this: PowerShell rai-audit init --project responsible-ai-demo rai-audit run --config audit.yaml For tabular ML, the Python API can look like this: Python from rai_audit.ml import ClassificationAudit report = ClassificationAudit( y_true=y_true, y_pred=y_pred, sensitive_features=sensitive_df, ).run() report.to_html("audit_report.html") The goal is to move from one-off evaluation scripts to repeatable audit runs that produce reviewable artifacts. What Can It Audit? RAI Audit Kit is designed around the idea that different AI systems need different audit lenses. For machine learning systems, the focus is on fairness, drift, data quality, and robustness. A model may perform well overall but still fail for certain subgroups or become unreliable after deployment.For deep learning systems, especially image and medical imaging models, the focus shifts toward robustness, explainability, patient leakage, site-level differences, and class-level performance.For LLM and RAG systems, the audit scope expands to prompt injection, unsafe output, toxicity, faithfulness, citation quality, retrieval quality, and retrieval security.For AI agents, the focus becomes tool use, memory, permissions, trace completeness, and prompt injection through external sources such as tools, webpages, retrieval systems, or email content. This article will not go deep into each area. Each one will be covered separately in the rest of the series. Why Evidence Matters Responsible AI audits should not disappear inside notebooks. A useful audit should answer: What checks were run?What data or predictions were evaluated?What findings were generated?What evidence supports each finding?Which artifacts were exported?Can the audit be repeated later?Can this be integrated into CI/CD? This evidence-first mindset is one of the main ideas behind the RAI Audit Kit. Reports can be exported in formats such as HTML, Markdown, and JSON. This makes the results useful for developers, reviewers, governance teams, and automation workflows. A simple audit flow may look like this: Plain Text Run evaluation ↓ Run responsible AI audit ↓ Generate findings ↓ Preserve evidence ↓ Export reports ↓ Review or gate deployment This does not replace human judgment. It gives reviewers better evidence to work with. Not a Compliance Shortcut It is important to be clear about the scope. RAI Audit Kit is a technical audit and reporting toolkit. It can help generate structured evidence and standards-oriented summaries, but it does not automatically certify that a system is compliant with any law, regulation, or internal policy. The goal is to support better review, not replace legal review, domain expertise, risk management, or organizational accountability. Responsible AI tools should help teams ask better questions and preserve better evidence. They should not create false confidence. Why This Project Matters Responsible AI needs practical engineering tools. Teams should be able to audit models, preserve evidence, compare results, and include risk checks in their development workflow. RAI Audit Kit is an early step in that direction. It brings together audits for ML, deep learning, LLMs, RAG systems, and AI agents under one Python suite. The core idea is simple: Responsible AI should be repeatable, evidence-backed, and built into the way we engineer AI systems. What’s Next in This Series In the next article, I will focus on auditing machine learning systems for fairness, drift, data quality, and robustness using the RAI Audit Kit. We will look at why accuracy alone is not enough, how subgroup performance can hide model risk, and how audit outputs can make ML review more structured and repeatable. Project Links GitHub: https://github.com/SaiTeja-Erukude/rai-auditInstall: pip install rai-audit-kit If you work on responsible AI, AI safety, LLM security, RAG systems, agentic AI, or MLOps, I would love feedback, ideas, and contributions.
AI-generated code is now the new normal in frontend development. A programmer can now simply request a React component, form, table, modal, or even a full-page layout and have something usable in seconds. That speed is real. Research on GitHub Copilot has shown that developers, by using the tool, achieved a staggering 55.8% faster completion of the coding task, which explains why software teams have become more inclined to use AI coding assistants. But speed at the time of generation is not equal to speed during production. The frontend code needs to cater to actual user interactions, devices, browsers, accessibility needs, API failures, product changes, and security provisions. AI can create the code fast, yet the hidden cost is revealed only after the initial draft, and things like code review, bug fixing, performance tuning, accessibility fixes, alignment with the design system, and maintenance have to be sorted out. The First Draft Is Not the Final Cost Teams often err when they look at the AI-generated frontend code only from the angle of the quick first version. A component that takes 30 seconds to generate can still take hours to make production-ready. Frontend code is especially vulnerable because often, "working" is visual. If the page renders and the button clicks, then the code looks finished. Nevertheless, frontend quality is much more than just rendering. Does the component manage the states of loading, empty, and error? Does it work with keyboard navigation? Does it respect the design system? Does it prevent unnecessary re-renders? AI usually presents you with only the happy path. However, production requires the unhappy paths as well. Review Effort Moves, It Does Not Disappear AI does not remove engineering judgment. Rather than spending their time coding, developers will now be focused on evaluating the code created for them. Developers will have to review the code they did not completely create, validate the assumptions made in creating the code (and determine what was assumed), and decide whether the generated output fits the existing architecture. According to a survey conducted by Stack Overflow in 2025, 84% of developers stated they currently utilize some form of AI tool; however, 46% of those same respondents claimed to not have faith in the correctness of the AI-generated code. That gap matters. If developers don't believe the output, they'll require additional time to validate. The risk exists that development teams may mark an AI-generated piece of code as "complete" without accounting for the time/effort needed to ensure it's safe, readable, and maintainable. Accessibility Is Easy to Miss Accessibility is one of the easiest areas for AI-generated frontend code to get almost right. The modal may visually appear to work as expected (e.g., pop-up) yet lack focus trapping. The dropdown may render correctly; however, it will likely not pass keyboard navigation. The custom button may even have a div with an onClick event and not utilize a semantic button. These problems are not just visual. WCAG 2.2 is a W3C Recommendation and provides a stable standard for making web content accessible. When using AI to generate your frontend code, if you ignore semantic HTML, ARIA rules, or keyboard behavior, then the team will inherit a significant amount of accessibility debt. This may be difficult to see immediately after a quick demo. To resolve this issue, include accessibility in the prompt, review checklist, and testing process. Request that the output use semantic HTML, support keyboard navigation, and labels that are screen-reader friendly. Review the output manually regardless. Performance Debt Can Be Generated Too AI can generate performance problems in a blink. It may add libraries that are not necessary, create large components, overuse client-side state, skip memoization when it is so essential, or put heavy calculations directly in the render path. For frontend teams, this is important because user experience equals performance. A page that is perfect in development can be unresponsive on a low-end mobile device or a slow network. AI-generated code is generally optimized for clarity in isolation, not for bundle size, hydration costs, rendering behavior, or Core Web Vitals. So, the generated code should be subjected to the same performance standards as the ones written by humans: bundle analysis, lazy loading where applicable, stable component boundaries, and measurements in actual environments. Design-System Drift Gets Worse Many frontend organizations rely on design systems for product consistency. However, AI-generated code can quietly work against this. For example, the model may generate CSS directly rather than using the tokens the team has implemented. It may create a custom modal when the company already has one. Similarly, it may use spacing, colors, typography, or interaction patterns that look alright in isolation but in reality do not match the product. Subsequently, this brings design-system drift. Every single custom component is another surface area for bugs, accessibility issues, and future migrations. Alignment with the design system should be seen as a prerequisite rather than an afterthought. Security and Data Handling Still Matter Most frontend code will contain information that should be protected from unauthorized access (for example, tokens, user IDs, analytics event data, api response data). Unfortunately, AI-generated code can accidentally normalize unsafe patterns such as logging all values of an object, exposing internal error messages, saving sensitive data in local storage, or sending more data to 3rd party services than is required. OWASP's Top 10 for Large Language Model Applications lists insecure output handling as one of the major risks and recommends not accepting the output of an LLM before validating it and controlling its output. The same is true of generated code. Just because your AI-generated code compiles does not mean it is secure. Treat generated code as you would any other untrusted contribution until after it has been reviewed for security. Maintainability Is the Real Test The real cost of frontend code appears after the first feature request. Is it easy for another engineer to get a grasp of the component? Can the extension of the component be done without the need to rewrite it? Does the component follow existing patterns? Are all edge cases tested? AI-generated code can be verbose, generic, or oddly structured. Solving the immediate problem while not fitting in the broader codebase may happen. Linting, testing, accessibility checks, or architectural review should not be bypassed by the generated code. It should not add new libraries without approval. Use AI as a Drafting Tool AI should be viewed as a drafting tool rather than an autonomous engineer. It will help to generate boilerplate, develop implementation ideas, draft test cases, provide explanations of unfamiliar code, and assist in creating a base version of repetitive UI. Ultimately, however, the engineering team has ownership of the ultimate design. The initial steps to developing a successful workflow involve providing the AI with a defined prompt that outlines the appropriate framework, the rules established by the organization's design system, the accessibility needs, the necessary states, and performance limitations. The team then reviews the AI-generated work against its organizational standards. They then refactor the work into pre-existing patterns. Next, they add testing. Then they measure performance. Finally, they assess accessibility. Conclusion The code generated by AI on the frontend is not free in the sense that generating it costs you nothing but time. It’s borrowed in that you get some time up front for free, and then you pay that back later with interest in the form of increased attention to detail, testing, accessibility, performance work, security reviews, and other aspects of maintainability. Used carefully and with the right expectations, AI can make frontend teams a lot faster to a certain extent than they are today. Used carelessly, it introduces a new kind of technical debt: code that looks finished before it is actually ready.
If you've worked on a data platform for more than a few years, you've almost certainly built the same pipeline twice. First, the way the team wrote pipelines in 2019: a notebook here, a Python script there, an Airflow DAG to glue it all together, and a long document explaining the order things had to run in. Then the rewrite, two years later, when somebody quit, and nobody could remember why a particular task had a sleep(180) in it. Lakeflow is Databricks' answer to that pattern, and the shift it's pushing for is bigger than the marketing makes it sound. It isn't a new orchestrator. It's a move from imperative pipelines, where you write the steps, to declarative pipelines, where you write the destination and let the engine figure out the steps. What follows is the practical version of that shift — what's actually different, where the gains are real, and how to migrate without ending up with a half-converted lakehouse. 1. The Imperative ETL Trap: Why Traditional Pipelines Are Hitting a Wall Imperative ETL is a fancy name for the way most pipelines are still written: a sequence of steps, hand-ordered, run on a schedule. It works fine until it doesn't, and the failure modes are remarkably consistent across teams I've worked with: The DAG outgrows its author. The person who wrote the original 30-task Airflow DAG moves teams. The next engineer is afraid to delete anything because they can't tell which tasks are still needed.Backfills are surgical operations. Re-running yesterday means manually figuring out which downstream tables are stale, in what order. Half the team's tribal knowledge lives in Slack threads about backfills.Quality checks are bolted on. Data quality lives in a separate framework, often a separate codebase, often run by a separate team. By the time a check fails, the bad data is already in the warehouse.Lineage is a slide in a deck. Whatever lineage exists was drawn by hand for a quarterly review and was out of date the day after. None of these are bugs in the imperative model. They're features of it. When you write the steps, you own the steps — including all the cross-task assumptions the engine doesn't know about. 2. What "Declarative" Actually Means in Lakeflow Declarative is one of those words that gets used loosely. In Lakeflow Pipelines, it has a specific, narrow meaning: you describe each table's logical definition (its source query, its expected schema, its quality rules), and the engine determines execution. It picks the order. It decides which tables are streaming and which are batch. It scales the cluster. It figures out incremental processing. It produces lineage automatically because lineage is now a derived property of the dependency graph it built for you. What it isn't: It isn't "low-code." You're still writing SQL or PySpark. The thing that's gone is the orchestration boilerplate around it.It isn't a magic upgrade for any pipeline. Pipelines that genuinely need procedural logic — multi-step API calls with branching, complex pre/post-processing — still belong in Lakeflow Jobs (the orchestrator) or even external code, called from the pipeline.It isn't free. There's a learning curve in stopping yourself from writing the steps you used to write. The first month, most teams over-specify. The mental shift: stop describing how the data should flow. Describe what each table is. Lakeflow figures out the flow. 3. The Lakeflow Architecture: Connect, Pipelines, Jobs Lakeflow is three components that share one governance layer (Unity Catalog). They map roughly onto the three traditional layers of a pipeline — ingestion, transformation, orchestration — but with the imperative wiring removed. Figure 1. Lakeflow's three components on top of Unity Catalog. Pipelines is the declarative core; Connect feeds it, Jobs schedules it. A few practical points about this picture. Lakeflow Connect is where managed connectors live (Salesforce, Workday, Postgres CDC, and a steadily growing list); it's the part you reach for instead of writing yet another ingestion script. Lakeflow Pipelines is where the declarative paradigm actually lives — every other component is conventional. And Lakeflow Jobs is the part that looks most like Airflow: task graphs, retries, alerts. The trick is that the things inside a Pipelines task aren't tasks themselves — they're table definitions, and the engine builds the internal DAG from their dependencies. 4. Translating an Imperative Pipeline to a Declarative One The clearest way to feel the difference is to look at the same logic written both ways. Imagine a small bronze→silver→gold pipeline for transactions: ingest raw files, deduplicate, then aggregate to daily totals. 4a. The imperative version (notebook + Airflow style) Python # bronze.py df = spark.read.json("s3://landing/txns/") df.write.format("delta").mode("append").saveAsTable("bronze.txns") # silver.py -- runs after bronze finishes raw = spark.table("bronze.txns") clean = (raw.dropDuplicates(["txn_id"]) .filter("amount IS NOT NULL")) clean.write.format("delta").mode("overwrite").saveAsTable("silver.txns") # gold.py -- runs after silver finishes agg = (spark.table("silver.txns") .groupBy("ingest_date", "account_id") .sum("amount") .withColumnRenamed("sum(amount)", "daily_total")) agg.write.format("delta").mode("overwrite").saveAsTable("gold.daily_totals") # airflow_dag.py -- the part that actually controls execution bronze_task >> silver_task >> gold_task 4b. The same logic, declared in a Lakeflow Pipeline Python import dlt from pyspark.sql.functions import sum as _sum @dlt.table( name="bronze_txns", comment="Raw transactions landed from S3.", ) def bronze_txns(): return (spark.readStream .format("cloudFiles") .option("cloudFiles.format", "json") .load("s3://landing/txns/")) @dlt.table(name="silver_txns", comment="Deduplicated, validated transactions.") @dlt.expect_or_drop("non_null_amount", "amount IS NOT NULL") @dlt.expect("unique_txn", "txn_id IS NOT NULL") def silver_txns(): return (dlt.read_stream("bronze_txns") .dropDuplicates(["txn_id"])) @dlt.table(name="gold_daily_totals") def gold_daily_totals(): return (dlt.read("silver_txns") .groupBy("ingest_date", "account_id") .agg(_sum("amount").alias("daily_total"))) Two things vanished in the rewrite. There is no DAG file, because the dependencies are inferred from dlt.read / dlt.read_stream calls. There is no separate data quality framework — quality lives next to the table definition, where it belongs. The engine decides what's streaming and what's batch from the calls themselves; bronze is a stream, silver is a stream of the bronze stream, gold is a batch over silver. None of that ordering is in the code I wrote. 5. Quality, Lineage, and Operational Visibility for Free The expectations decorators above (@dlt.expect, @dlt.expect_or_drop, and the stricter @dlt.expect_or_fail) are not just convenience syntax; they become first-class objects in the pipeline. Every run produces a per-expectation pass/fail count, queryable directly: SQL -- How many silver rows failed each expectation, per run, last 7 days SELECT pipeline_run_id, flow_name, expectation_name, passed_records, failed_records, dropped_records FROM event_log("<pipeline-id>") WHERE event_type = 'flow_progress' AND timestamp >= current_timestamp() - INTERVAL 7 DAYS ORDER BY timestamp DESC; Lineage shows up automatically in Unity Catalog — both the table-level edges (gold_daily_totals depends on silver_txns) and column-level edges (gold's daily_total derives from silver's amount). Operationally, this is the change that has the largest day-to-day impact: when somebody asks "what does this column mean and where did it come from," you stop having to guess. What this replaces: Great Expectations runs scheduled separately, OpenLineage stitched together by hand, and a homegrown observability dashboard reading task logs. All three of those projects either go away or shrink dramatically. 6. Migration Strategy: How Teams Actually Move Off Imperative Pipelines I've not seen a successful big-bang migration. The pattern that works is layered: Phase 1 — New pipelines only Make Lakeflow Pipelines the default for any new pipeline. This sounds obvious; the discipline is in saying no when somebody wants to add "just one more" Airflow DAG to the imperative side because it's faster this week. Phase 2 — Convert the painful ones Pick the existing pipelines that hurt the most — the ones with the longest backfill stories, the most ad-hoc quality checks, the worst lineage gaps. Those are the ones where the declarative model pays for the rewrite cost fastest. Don't start with the easy ones; their owners won't thank you for the disruption. Phase 3 — Retire the orchestration boilerplate Once a critical mass of pipelines has moved over, you can shrink (or in many cases delete) Airflow setups, custom dependency-tracking tools, and the side projects that grew up around imperative ETL. This is the phase where the cost savings actually show up in headcount and infrastructure bills. Migration step Effort Watch out for New pipelines on Lakeflow Low Team momentum — easy to revert to old patterns. Convert the top 3 painful pipelines Medium Different streaming/batch semantics in expressed dependencies. Move expectations off external DQ tools Medium Existing alerting wired to the old framework. Retire imperative orchestrator High External callers (BI tools, ML jobs) that triggered DAGs directly. 7. Where Declarative Still Hurts: Honest Limitations I'd be lying if I said this was free. The places where the declarative model still bites: Procedural logic doesn't fit. If your "pipeline" is really a sequence of API calls with branching error handling, that's a Lakeflow Job (or external code), not a declarative table.Cross-pipeline orchestration is its own thing. Lakeflow Pipelines builds the DAG inside a pipeline. If you need pipeline A to wait for pipeline B, you still need Lakeflow Jobs above them.Debugging shifts from steps to definitions. When something is wrong, you're not stepping through a script — you're reading the event log and figuring out which expectation or upstream table caused it. The tooling is good; the muscle memory is different.Cost can surprise you. Auto-scaling on a misbehaving streaming source has the same risk it always has. Set max workers thoughtfully on day one; don't leave it to defaults. Conclusion The shift to declarative pipelines isn't really about syntax. It's about who owns the boring parts. In an imperative pipeline, the team owns the order, the retries, the lineage, the quality checks, and the cluster scaling — and pays in headcount when any of those break. In a declarative pipeline, those become properties of the engine, and the team owns the part that's actually interesting: the table definitions and the business logic. Lakeflow is the cleanest implementation of that idea I've used in production, and the teams I've watched migrate haven't asked to go back.
Between December 22, 2025 and January 15, 2026, an attacker spent 24 consecutive days inside Navia Benefit Solutions' systems. They quietly and methodically pulled Social Security numbers, dates of birth, health plan enrollment details, and COBRA records belonging to 2,697,540 Americans. These include teachers, state workers, and school administrators. People who signed up for employer benefits through HR software and had no idea which third-party company held their data. Navia didn't catch it for more than three weeks after the attacker had already stopped. The company published a breach notice on March 13, 2026. Individual notification letters went out on March 18 — eighty-six days after the intrusion began. The technical cause was not sophisticated. A BOLA vulnerability in Navia's API allowed an authenticated user to manipulate request identifiers and retrieve records belonging to other participants. Change a number in the API parameter, return a different person's record. The attack required no zero-day exploit. No social engineering. No supply chain compromise. Just an API that checked whether you were logged in and never asked whether the record you were requesting was yours. That's the breach that cost 2.7 million Americans their healthcare data and personal identifiers in early 2026. And it's not an outlier. I've spent the last eighteen months studying API breaches in depth — formal postmortems, SEC disclosure filings, state attorney general notification records, security research writeups, and direct conversations with incident responders who cleaned up the aftermath. The sample spans healthcare, fintech, retail, SaaS platforms, government infrastructure, and consumer applications. More than fifty incidents analyzed at a structural depth. The technologies differ. The industries differ. The victim organizations range from county governments to billion-dollar enterprises. The mistakes are, with remarkable consistency, the same five. This is not a vulnerability catalog. It is a pattern analysis. And the pattern points to something the industry has been reluctant to say plainly: most API breaches are not caused by sophisticated attackers. They are caused by undisciplined defenders repeating failures the field already knows how to prevent. The Infrastructure That Cannot Afford to Fail Quietly Before the patterns, the scale of the problem requires a precise frame — not as context-setting, but because the numbers explain why discipline failures at this layer are so consequential. API incidents now account for over 30% of all data breaches, up from less than 20% two years ago. API breaches expose an average of more than 2.5 million records per incident, significantly higher than traditional breaches. 38% of organizations discovered API breaches only after external reporting, not internal detection. That last figure is the one that should stop readers cold. More than a third of organizations learn about API breaches from someone other than their own security team. From a reporter. From a researcher submitting a bug bounty report. From a law enforcement notification. From a dark web listing of their customers' data, already sold. The Navia incident was consistent with the 38%: the company discovered the intrusion eight days after the attacker had already stopped accessing systems. By the time Navia detected anything, the data was gone, and the window for limiting exposure had closed. APIs have become the operational substrate of modern software. A mobile banking application's backend is a collection of APIs. A SaaS platform's data sharing is API-mediated. An AI agent answering customer queries calls APIs that call other services that query databases through yet more APIs. The attack surface isn't just large — for most organizations, it's partially unmapped. Endpoints built by contractors and never formally decommissioned. APIs generated by AI coding tools without the security review human-written code receives. Internal service APIs that were never intended to face external traffic and ended up there anyway. 56% of enterprises admit they lack full visibility into their API data flows. The thing they can't see is the thing that's being exploited. Pattern One: Authentication and Authorization Are Not the Same Concept — The Industry Keeps Treating Them as If They Are The Navia breach has a precise technical name: Broken Object Level Authorization. It has been the number-one entry on the OWASP API Security Top 10 since 2019. It accounted for a Parler breach that exposed 70 terabytes of user data. It drove the USPS vulnerability that sat unpatched for over a year after a researcher reported it, and was only fixed after journalist Brian Krebs published the story. It accounts for over 40% of API vulnerabilities today. Seven years. Number one. Still responsible for 40% of incidents. The reason BOLA persists is structural, not ignorance. Engineering teams understand the distinction intellectually. The failure is in the architectural gap between understanding it and enforcing it consistently across every endpoint, every integration, and every API built under deadline pressure by developers who know they should implement the ownership check and don't always do it. Authentication verifies: Who is making this request? Authorization verifies: Does this specific identity have permission to access this specific object? These are different questions. Authentication is typically enforced at a framework or middleware layer — configured once, centrally, applied everywhere. Object-level authorization is implemented per-endpoint, by the individual engineer who wrote that endpoint, with whatever understanding of the ownership model they had on the day they wrote the code. The structural asymmetry produces an architectural guarantee: authentication will be applied consistently because it's centralized; authorization will be applied inconsistently because it isn't. The attack is elementary: WHAT THE API DOES: GET /api/v1/benefits/participant/883441 → 200 OK { ssn: "XXX-XX-4291", dob: "1979-03-14", plan: "FSA" } (your record — you're authenticated, you can see this) WHAT BOLA ALLOWS: GET /api/v1/benefits/participant/883442 → 200 OK { ssn: "XXX-XX-7738", dob: "1984-11-02", plan: "COBRA" } (someone else's record — you're authenticated, but this isn't yours) GET /api/v1/benefits/participant/883443 → 200 OK ← and again GET /api/v1/benefits/participant/883444 → 200 OK ← and again ... × 2,697,540 WHAT SHOULD HAPPEN: GET /api/v1/benefits/participant/883442 → 403 Forbidden (request fails ownership check: token owner ≠ record owner) The fix is a single check, applied at the data access layer before the record is returned: does the authenticated identity own or hold explicit permission for the requested object? That check is architecturally simple. It takes minutes to write for a given endpoint. Applied to every endpoint, consistently, across a codebase that spans dozens of services and years of development history, it requires organizational discipline that companies apparently find harder to sustain than it sounds. Authorization checks for individual resources are usually too fine-grained to offload to centralized platforms like API gateways or IAM products. The responsibility sits with API developers to implement the proper checks at the API endpoint. That sentence explains why BOLA is still happening in 2026. There is no platform that catches it automatically. No gateway configuration that prevents it. No WAF rule that blocks it. The check has to be written by engineers who know what correct authorization looks like for this specific system, tested by security engineers who know how to probe for its absence, and validated adversarially in CI/CD rather than assumed to exist because someone believes they wrote it. BOLA sits at the top of the OWASP API Security Top 10. It's been the most common API vulnerability for years. Every API security guide warns about it. The organizations still producing these breaches aren't unaware of BOLA. They're applying the authorization check inconsistently, untestedly, and without the adversarial test suite that would catch it before an attacker does. Pattern Two: Trust Relationships Accumulate Silently While Security Visibility Stays Static The 700Credit breach, disclosed in early 2026 and subject to consolidated federal litigation by February of that year, traced to a compromise through a third-party integration partner. An exposed API enabled the extraction of consumer data — Social Security numbers, credit information — belonging to approximately 5.6 million individuals. The API existed because a third-party integration required it. The third party was compromised. The access chain from the compromised partner to the sensitive consumer records was shorter than anyone had documented. Third-party APIs exposed millions of records at 700Credit, while weak airline API authentication fueled mass access at Qantas. Third-party integrations now represent the initial access vector in more than a quarter of API breaches. The mechanism isn't exotic: every integration creates a trust relationship, and trust relationships accumulate faster than the security reviews that should accompany them. Consider what happens to an organization's integration landscape over two years of normal product development. A partner API is connected for a feature that shipped and drove modest adoption. The API integration remains active; the feature is no longer actively developed. A contractor builds an internal service integration for a project that was completed and handed off. The service account credential used by that integration was never revoked. A third-party data enrichment vendor is added to the user onboarding flow with read access to customer records. Six months later, the enrichment vendor updates its API client library, and an engineer upgrades the dependency without reviewing the new permission scope. None of these represents malicious action or negligent individual decisions. They represent the natural accumulation of a complex integration landscape under continuous development, without the organizational process to maintain security visibility at pace with that development. Machine identities — credentials that authenticate services, workloads, and devices — outnumber human identities by more than 45 to 1, according to CyberArk. The proliferation of static keys, long-lived tokens, and embedded credentials has led to uncontrolled secrets sprawl across codebases, repositories, and collaboration tools. Machine identities don't appear in quarterly access reviews. They don't get deprovisioned when a project ends or when the engineer who created them changes roles. They don't trigger MFA prompts. When a machine identity is compromised — whether through a leaked credential or a supply chain attack on the service using it — the blast radius is often substantially larger than any individual's human identity would have been, because the service account may have been provisioned with elevated permissions for a project requirement that no longer exists. The structural fix requires treating machine identity governance with the same rigor as human identity governance: defined business purpose at provisioning, periodic review against defined staleness criteria, automated detection of credentials operating outside their documented scope, and revocation procedures that can be executed without requiring the engineer who originally created the credential to be in the loop. Most organizations are three to five years behind on this. The incident record reflects it. Pattern Three: Secrets Leak Into Every Surface, and Almost Nobody Rotates Them 28.65 million new hardcoded secrets were added to public GitHub commits in 2025 alone — a 34% increase year over year and the largest single-year jump GitGuardian has recorded. That number deserves a full stop. Secret leak rates in AI-assisted code were, on average across the year, roughly double the GitHub-wide baseline. AI service credential leaks increased 81% year over year, to 1,275,105. Claude Code-assisted commits leaked secrets at approximately 3.2%, twice the baseline. The acceleration has a specific mechanism. AI coding tools have lowered the barrier to building API integrations, which is mostly good. They've simultaneously created a new class of developer — experienced in product and logic, less experienced in security conventions — who builds quickly and may not know that the API key they copied from the project documentation should go into a secrets manager rather than the .env file committed alongside the rest of the project. Across 6,943 systems, GitGuardian identified 294,842 secret occurrences corresponding to 33,185 unique secrets. On average, each live secret appeared in eight different locations on the same machine, spread across .env files, shell history, IDE configs, cached tokens, and build artifacts. 59% of compromised machines were CI/CD runners, not personal laptops. The CI/CD figure is where the pattern becomes structurally dangerous rather than merely careless. A secret on a developer's laptop is an individual exposure. A secret on a CI/CD runner is accessible to every process that executes in that environment — including processes introduced through supply chain attacks. The LiteLLM supply chain attack demonstrated this pattern concretely: compromised packages harvested SSH keys, cloud credentials, and API tokens from developer machines where AI development tooling had concentrated credentials. MCP configuration files are a new and largely unmonitored leak surface. In 2025, 24,008 unique secrets were exposed in MCP-related configs on public GitHub — 8.8% confirmed valid at the time of detection. The remediation gap transforms bad leak rates into chronic exposure. Nearly 70% of credentials confirmed as valid in 2022 were still valid in January 2025. When retested in January 2026, the validity rate was still above 64%. Three years of known exposure. More than six in ten credentials still live. The detection is working; the remediation isn't. Organizations that deploy secret scanning without building the organizational process to act on findings — to rotate credentials on a defined timeline, to identify every system using a given credential before revoking it, to treat found secrets as an urgent remediation item rather than an informational alert — are doing the technical equivalent of installing smoke detectors and then watching the building burn. Pattern Four: Monitoring Was Built to Watch the Infrastructure, Not the Behavior In 2025, the global median attacker dwell time after initial compromise was 14 days — up from 11 days in 2024, according to Mandiant's M-Trends 2026 report. The interval between initial compromise and lateral movement fell to 29 minutes — a 65% acceleration from the previous year. In at least one case, data exfiltration began within four minutes of entry. Fourteen days median dwell time. Four minutes to exfiltration in the fastest case. The attacker's operational tempo in 2025 was faster than any previous year on record; the detection tempo moved in the wrong direction. The Navia breach ran for 24 days without triggering any internal detection. That's not exceptional — it's slightly above median. 34% of incidents had an unknown or undetermined initial vector, indicating significant gaps in logging and detection capabilities. The unknown-vector incidents are, by definition, the ones where the monitoring infrastructure failed to capture the access path entirely. The reason BOLA exploitation goes undetected for weeks is that it produces none of the signals that infrastructure monitoring was built to catch. The requests are correctly formed. The authentication succeeds. The responses return 200. The rate may be elevated, but elevated API request rates are also the signature of legitimate mobile applications, legitimate batch processing, and legitimate partner integrations under load. The only distinguishing characteristic — that the object IDs being queried belong to other users — requires business logic context that standard monitoring infrastructure doesn't have. You cannot investigate data you never collected. The more consequential version of that principle is: you cannot detect anomalies against a baseline you never defined. Application-layer attacks — exploits targeting web applications, APIs, and software supply chains — often fly under the radar because traditional security tools were not designed to see them, especially at runtime. API behavioral monitoring requires two things that most organizations have not built. First, a behavioral baseline per endpoint: what does legitimate usage look like for this specific API, this specific authentication context, this specific integration? What's the expected distribution of object IDs accessed per session? What rate of data retrieval is consistent with the documented business purpose of each authenticated identity? Second, anomaly definitions calibrated to those baselines: what specific patterns constitute evidence of enumeration or exfiltration rather than legitimate high-volume operation? Baselines cannot be automatically inferred from traffic data without business logic context. They require human authorship — people who understand what the API is supposed to do, defining what legitimate usage looks like in operational terms. That work is unglamorous. It doesn't ship a feature. It doesn't close a compliance checkbox. It is the difference between detecting a breach in hour four and detecting it after the attacker has been gone for eight days. Pattern Five: Security Is Defined as a Project With an End Date The three major French retailers — Boulanger, Cultura, and Truffaut — experienced a coordinated API attack through their shared e-commerce backend in 2024. The breach stemmed from poorly configured API security rules. One misconfiguration. Three companies compromised. Millions of customer records stolen. Shared infrastructure meant one vulnerability cascaded across all platforms. The shared infrastructure attack surface is an example of what happens when security review occurs at deployment and isn't revisited as the integration architecture evolves. Each retailer's security posture changed when the shared backend was modified, when new partners connected, and when access control configurations were updated for a new feature. The review that approved the original configuration didn't cover those subsequent changes. This is the fundamental failure of treating security as a project: projects have end dates. Security exposure doesn't. A penetration test produces a snapshot of a system as it existed during the two-week engagement window. That snapshot is accurate when it's produced and becomes less accurate with each subsequent code deployment, configuration change, and new integration. Organizations that treat the pen test result as ongoing assurance — that consider security "done" until the next compliance cycle — are operating on a security posture that no longer accurately describes their actual attack surface. Attackers don't operate on project timelines. Automated scanning tools find newly deployed endpoints within minutes. Attackers use automated scanning tools to identify API vulnerabilities within minutes of deployment. The enterprise security review cycle typically runs quarterly or annually. The gap between "API deployed" and "API found by automated scanner" is measured in minutes. The gap between "API deployed" and "API reviewed by security team" is measured in months. 68% of organizations experienced an API security breach resulting in costs exceeding $1 million. The organizations accumulating that exposure are largely not the ones that skipped security entirely. They're the ones that did security once — at the right moment, with the right tools, producing the right findings — and then moved on. The API Security Lifecycle: What Continuous Practice Actually Looks Like The pattern analysis above points to a consistent structural need: security disciplines that operate continuously across the full API lifecycle, not at discrete compliance milestones. The following framework — the API Security Lifecycle — organizes those disciplines into a model where security is a property the system continuously maintains, not a state the organization periodically verifies: StageWhat happens hereBreach pattern closedDesignDefine the object ownership model before the first line of code is written.Pattern 1: BOLA — Prevents broken object-level authorization by design, not just testing.DesignDocument machine identity scope at provisioning.Pattern 2: Trust boundaries — Defines access limits before integrations go live.Threat modelingMap the BOLA surface by reviewing every endpoint that returns objects and assessing ownership enforcement.Pattern 1: BOLA — Forces teams to identify authorization gaps before shipping.Threat modelingAudit trust boundaries by documenting every integration and its scope.Pattern 2: Trust boundaries — Makes third-party attack surfaces visible before they become blind spots.DevelopmentEnforce BOLA checks at the data layer, not just the controller.Pattern 1: BOLA — Makes ownership checks harder to bypass.DevelopmentUse secrets from a vault starting with the first commit, with enforcement during code review.Pattern 3: Hardcoded secrets — Keeps credentials out of the repository.TestingRun an adversarial BOLA test suite for each endpoint in CI/CD on every push.Pattern 1: BOLA — Validates every endpoint before it ships.TestingAdd secret scanning to CI with a defined remediation SLA.Pattern 3: Leaked secrets — Ensures leaks are rotated, not just detected.MonitoringBuild behavioral baselines per endpoint with input from people who understand the API.Pattern 4: Weak detection — Makes Navia-type enumeration detectable in hours, not weeks.MonitoringTie anomaly definitions to ownership context, not just rate thresholds.Pattern 4: Weak detection — Triggers alerts on enumeration behavior, not only traffic spikes.Continuous validationAutomate API inventory so every live endpoint is known, documented, and reviewed.Pattern 5: Unknown endpoints — Finds new endpoints before attackers do.Continuous validationReview trust relationships every 90 days with defined revocation criteria.Pattern 2: Stale trust — Removes unnecessary integrations before they become attack paths.Continuous validationEnforce credential rotation automatically with documented rotation SLAs.Pattern 3: Stale secrets — Reduces the risk of old or exposed credentials remaining valid. The framework's structure is intentional: every stage maps to a specific failure pattern, and every failure pattern is addressed at the stage where prevention is cheapest. BOLA is cheapest to address at design and development; catastrophically expensive to address after 2.7 million Social Security numbers have been exfiltrated. Secret exposure is cheapest to address at development, with vault-first discipline and code review enforcement; expensive to address after a compromised CI/CD runner has propagated credentials across build infrastructure. At Design The object ownership model gets written before the first endpoint is coded. Not as an afterthought — as a specification that the authorization implementation must satisfy. The authorization model names every object type in the system, defines the ownership structure, and specifies the access control rules governing cross-user access. That specification becomes the adversarial test suite's source of truth. At Threat Modeling The BOLA surface gets mapped: every endpoint that returns an object, every parameter that could be manipulated, every authorization assumption that isn't yet validated. This doesn't need to be a multi-week engagement. For a new API, a focused 90-minute session with the engineering team produces a complete BOLA surface map and surfaces the authorization assumptions that need explicit testing. At Development The ownership check lives at the data access layer — not at the controller layer, where a bypass path might exist. A controller-layer check can be bypassed if there's a second code path to the same data. A data layer check cannot. This architectural discipline requires a conversation during design, not during code review. At Testing The adversarial BOLA suite runs in CI/CD on every push. Not once a quarter during a security review — on every push. The suite consists of tests written to fail if authorization is absent: authenticated requests for objects the test user doesn't own, verifying that the response is 403 rather than 200. These tests are not generated by scanners. They are written by engineers who know the ownership model, because ownership model knowledge is not accessible to automated scanning tools. At Monitoring Behavioral baselines per endpoint are authored, not inferred. For the Navia breach scenario, a baseline that defined expected participant record access as "1-3 records per authenticated session, with alert threshold at 15 distinct participant IDs in a 60-minute window" would have triggered an anomaly detection response within the first hour of the 24-day access window. The attacker would not have had weeks of silent operation; they would have triggered a human investigation while the breach was still recoverable. At Continuous Validation Security review becomes a property that the system maintains continuously, not a milestone that occurs at fixed intervals. API inventory automation catches new endpoints before they go through a full quarter unreviewed. Trust relationship reviews on a defined cadence — 90 days is a reasonable default — ensure that stale integrations and credentials don't survive long enough to be exploited. Credential rotation with automated enforcement ensures that the 2022 leaked secrets that are still valid in 2026 don't remain valid in 2027. What the Next Three Years of API Security Look Like The five patterns described above operate against the current API attack surface. The emerging surface stresses those patterns further and creates new failure modes that the field is only beginning to grapple with. AI-generated APIs are the newest expansion of the BOLA surface. AI coding tools that scaffold endpoint logic do so quickly and efficiently, and at double the baseline secret leak rate. Whether those endpoints enforce object-level authorization correctly is a function of the prompts used to generate them, the review those prompts received, and the adversarial test coverage applied afterward. Organizations that have embedded security requirements into their AI coding tool configurations — ownership check as a required component of every endpoint scaffold, secrets-in-vault as a non-negotiable default — are addressing this. Organizations that are using AI coding tools as productivity accelerators without corresponding security configuration adjustments are building the BOLA surface of 2027. Agent-to-agent APIs are creating authorization chains that most API security practices weren't designed to evaluate. When an AI agent makes a tool call that calls an API that calls another service, the authorization context propagates through multiple hops. Whether each hop enforces the ownership model correctly, and whether the aggregate chain produces authorized outcomes even when individual hops appear compliant, requires analysis at the orchestration boundary that current API security tooling doesn't perform. This is not a solved problem. The breach categories it will produce are already structurally predictable. Machine identity sprawl will continue to grow faster than machine identity governance. Since 2021, secrets have been growing roughly 1.6 times faster than the active developer population. Every AI agent deployment creates non-human identities with scoped permissions. Those identities accumulate. The credential management failure that produced the current breach record will produce a larger breach record when the number of machine identities per organization doubles again. Real-time risk assessment — dynamically adjusting API access based on behavioral context, identity posture, and request risk signals — represents where the field needs to move. Continuous authorization rather than static permission grants. Access decisions that incorporate session history, anomaly signals, and behavioral baseline deviation. This is architecturally ambitious and requires the behavioral monitoring foundation that Pattern Four identifies as currently absent from most deployments. The prerequisite for all of these advanced capabilities is getting the five fundamentals right first. Zero-trust architectures built on top of authorization logic that doesn't enforce ownership checks are security theater. Advanced anomaly detection built on top of monitoring that has no behavioral baselines is expensive noise generation. The advanced work only creates value if the foundational discipline exists. The Pattern Is the Point The Navia breach didn't require a sophisticated attacker. It required an enumerable resource identifier and the absence of an ownership check. The same technique that worked against Parler in 2021, against USPS before that, against Spoutible, against Optus. The technique hasn't changed because the foundational failure it exploits hasn't been corrected at the organizational level. The five 2025 API security incidents are not the result of exotic exploits, but of fundamental security omissions. From forgotten legacy endpoints and broken authorization to excessive data exposure, they prove that the greatest threats lie in what is unmanaged, untested, and untracked. The industry has a framing problem. Every major breach gets treated as a novel incident requiring a novel analysis. The technical specifics differ; the structural failures underneath them are the same five patterns, in different combinations, producing different consequences. Treating each incident as sui generis means the field never builds the pattern recognition that would let organizations address the root cause rather than the surface symptom. Security maturity begins when organizations stop analyzing each breach individually and start recognizing the structural failures that keep producing them. The five patterns here are not predictions about where the next breach will come from. They are descriptions of the conditions present in most production API environments right now — conditions that produce predictable consequences when an attacker decides to look. The Navia breach affected 2.7 million people. It was discovered eight days after it ended. The notification went out eighty-six days after it began. The vulnerability that enabled it has been the industry's number-one documented API risk for seven years. The next one is already running. In an organization with excellent infrastructure monitoring, clean logs, and a security team that reviewed the codebase at launch. In a system where nobody wrote the adversarial authorization test that would have caught it. The data will be there in the logs. The pattern will be familiar. The prevention was always available. References Navia Benefit Solutions breach disclosure (Maine AG filing, March 2026)700Credit breach federal litigation records (February 2026)GitGuardian State of Secrets Sprawl 2025 and 2026Mandiant M-Trends 2026OWASP API Security Top 10 (2023 and 2025 editions)Equixly 2025 API Incident AnalysisAPIsecurity.io Top 5 API Vulnerabilities 2025CyberArk Machine Identity Management Report 2025SQ Magazine API Security Breach Statistics 2026Corelight Attacker Dwell Time Analysis (2026)SecurityWeek Navia breach reporting (March 2026)
In this article, we will understand how vector search works in Azure AI Search and how to use it as the retrieval layer in a Retrieval-Augmented Generation (RAG) system. The article is meant for software engineers. We will not stop at theory. We will build a small, working example that you can run on your own machine and follow along step by step. By the end, you will have a small document search service that takes a user question, finds the most relevant text using vector similarity, and prepares the context that you can pass to a language model. Please note that Azure AI Search was earlier called Azure Cognitive Search. The service was renamed, but many older articles and code samples still use the old name. The concepts are the same. Let us begin. What Is a Rag System, in Short A RAG system has two main parts. The first part is retrieval. When a user asks a question, we search a knowledge base and pull out the most relevant pieces of text. The second part is generation. We pass these pieces of text, along with the question, to a language model so that the model can answer using real, grounded information. The quality of a RAG system depends heavily on the retrieval part. If retrieval returns the wrong text, the language model will produce a wrong or vague answer. This is the reason vector search matters. It allows us to retrieve text based on meaning, not only on keyword matching. Why We Need Vector Search Traditional keyword search matches exact words. If the user searches for "car" and the document says "automobile", a keyword search may miss it. Vector search solves this problem. In vector search, we first convert each piece of text into a list of numbers called an embedding. Texts with similar meaning produce embeddings that are close to each other in vector space. When a user asks a question, we convert the question into an embedding as well, and then we find the stored embeddings that are nearest to it. This is called nearest neighbor search. Azure AI Search supports this by allowing you to define a vector field in your index. You store the embedding in this field, and the service builds a structure that can search through many vectors quickly. How Azure AI Search Performs Vector Search Azure AI Search supports two algorithms for vector search. The first is HNSW (Hierarchical Navigable Small World), which is an approximate nearest neighbour (ANN) algorithm. It is fast and is the recommended choice for most production workloads. The second is exhaustive KNN, which compares the query against every stored vector. It is exact but slower, and it is mainly useful for small data sets or for measuring the accuracy of the approximate method. The following table compares the two algorithms. AlgorithmTypeSpeed on large dataRecallBest suited forHNSWApproximate (ANN)FastVery high and tunableMost production workloads and large indexesExhaustive KNNExactSlow on large dataExact (100 percent)Small data sets or measuring ground-truth accuracy In Azure AI Search, the algorithm is not attached directly to a field. Instead, you define a vector search configuration that contains a list of algorithms and a list of profiles. A profile gives a name to a chosen algorithm. Each vector field then refers to a profile by its name. This extra layer makes it easy to reuse the same algorithm settings across several fields. The vector field itself uses the type Collection(Edm.Single), which is a collection of single-precision floating-point numbers. You must set the number of dimensions on this field, and this number must match the output size of your embedding model. The Architecture of Our Rag System Before writing code, let us look at the full picture. There are two paths. The ingestion path runs once (or whenever your data changes) and fills the index. The query path runs every time a user asks a question. Please note one important detail. The same embedding model must be used in both paths. If you embed your documents with one model and your queries with another, the vectors will not be comparable, and the search results will be meaningless. The Data Model When we store a document for RAG, we usually do not store the full document as one record. We split it into smaller chunks because a smaller chunk gives more focused retrieval and fits better inside the language model prompt. Each chunk becomes one document in the Azure AI Search index, and each document holds the chunk text, its embedding, and some metadata for tracing the result back to its source. The following entity relationship diagram shows how a source document relates to chunks, and how each chunk is stored as one search document in the index. In our small example, our documents are already short, so we will treat each document as a single chunk. In a real system, you would add a chunking step, but the structure of the index will remain the same. Hands-on Tutorial Now we will build the system. I am assuming you have an Azure subscription, Python (version 3.9 or above), and basic familiarity with running Python scripts. We will create the embeddings on our own machine using a small open model, so that you do not need to set up an Azure OpenAI deployment to follow along. I will explain at the end how to switch to Azure OpenAI embeddings if you prefer. Step 1: Create an Azure AI Search Service In the Azure portal, create a resource of type "Azure AI Search". The free tier is enough for this tutorial. After the service is created, open it and note down two values from the portal. The first is the service endpoint, which looks like https://<your-service>.search.windows.net. The second is an admin key, which you will find under the "Keys" section. We will use these in our code. Please keep the admin key secret. In a real project, you would store it in an environment variable or in Azure Key Vault, not in the source code. Step 2: Install the Python Libraries pip install azure-search-documents sentence-transformers Step 3: Create the Vector Index Now we create an index that has a vector field. We define the field, the vector search configuration (the algorithm and the profile), and then we create the index. Python from azure.core.credentials import AzureKeyCredential from azure.search.documents.indexes import SearchIndexClient from azure.search.documents.indexes.models import ( SearchIndex, SimpleField, SearchableField, SearchField, SearchFieldDataType, VectorSearch, HnswAlgorithmConfiguration, HnswParameters, VectorSearchProfile, ) endpoint = "https://<your-service>.search.windows.net" admin_key = "<your-admin-key>" index_name = "rag-docs" index_client = SearchIndexClient( endpoint=endpoint, credential=AzureKeyCredential(admin_key), ) # 1. Define the fields. The embedding field is the vector field. fields = [ SimpleField(name="doc_id", type=SearchFieldDataType.String, key=True, filterable=True), SearchableField(name="text", type=SearchFieldDataType.String), SimpleField(name="source", type=SearchFieldDataType.String, filterable=True), SearchField( name="embedding", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), searchable=True, vector_search_dimensions=384, # must match the embedding model vector_search_profile_name="my-vector-profile", ), ] # 2. Define the vector search configuration: one algorithm and one profile. vector_search = VectorSearch( algorithms=[ HnswAlgorithmConfiguration( name="my-hnsw", parameters=HnswParameters( m=4, ef_construction=400, ef_search=500, metric="cosine", ), ) ], profiles=[ VectorSearchProfile( name="my-vector-profile", algorithm_configuration_name="my-hnsw", ) ], ) # 3. Create (or update) the index. index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search) index_client.create_or_update_index(index) print("Index created:", index_name) Let me explain the choices here. The vector_search_dimensions is 384 because that is the output size of the embedding model we will use. The vector field points to a profile named my-vector-profile, and that profile points to the HNSW algorithm named my-hnsw. The metric is cosine, which is a common choice for text embeddings. The next table explains the HNSW parameters and their default values in Azure AI Search. ParameterWhere it is setWhat it controlsDefault and rangemAlgorithm configurationNumber of bi-directional links each node keeps in the graphDefault 4, range 4 to 10efConstructionAlgorithm configurationNumber of candidate neighbours examined while building the graphDefault 400, range 100 to 1000efSearchAlgorithm configurationNumber of candidate neighbours examined during a queryDefault 500, range 100 to 1000metricAlgorithm configurationThe distance metriccosine; other values are dotProduct, euclidean, and hamming The default values are a reasonable starting point. You can tune them later based on your recall and latency requirements. Step 4: Embed and Upload the Documents Now we load the embedding model, create a small knowledge base, generate embeddings, and upload the documents to the index. Python from azure.search.documents import SearchClient from sentence_transformers import SentenceTransformer search_client = SearchClient( endpoint=endpoint, index_name=index_name, credential=AzureKeyCredential(admin_key), ) # Load the embedding model (produces 384-dimensional vectors) model = SentenceTransformer("all-MiniLM-L6-v2") # Our small knowledge base documents = [ {"doc_id": "1", "source": "billing-faq", "text": "You can update your payment method from the account settings page under Billing."}, {"doc_id": "2", "source": "billing-faq", "text": "Refunds are processed within five to seven business days to the original payment method."}, {"doc_id": "3", "source": "shipping-faq", "text": "Standard delivery takes three to five working days within the country."}, {"doc_id": "4", "source": "account-faq", "text": "To reset your password, click the forgot password link on the login screen."}, {"doc_id": "5", "source": "shipping-faq", "text": "International orders may take up to fourteen working days depending on customs."}, ] # Create embeddings for the text of each document texts = [d["text"] for d in documents] vectors = model.encode(texts) # Attach the embedding to each document and upload to_upload = [] for doc, vector in zip(documents, vectors): to_upload.append({ "doc_id": doc["doc_id"], "source": doc["source"], "text": doc["text"], "embedding": vector.tolist(), }) result = search_client.upload_documents(documents=to_upload) print("Uploaded", len(result), "documents.") When you run this script, it will upload five small documents into the index. In a real project, you would read documents from files or a database, split them into chunks, and upload thousands or millions of documents using the same upload_documents method, usually in batches. Step 5: Search Using a Query Vector Now we write the retrieval function. We embed the user question with the same model, then we run a vector query. The VectorizedQuery object tells Azure AI Search which vector to search with, how many neighbours to return, and which field to search against. Python from azure.search.documents.models import VectorizedQuery def search(question, k=3): # Embed the question with the same model query_vector = model.encode([question])[0] vector_query = VectorizedQuery( vector=query_vector.tolist(), k_nearest_neighbors=k, fields="embedding", ) results = search_client.search( search_text=None, # pure vector search vector_queries=[vector_query], select=["doc_id", "source", "text"], ) output = [] for r in results: output.append({ "score": r["@search.score"], "source": r["source"], "text": r["text"], }) return output for item in search("how do I get my money back", k=3): print(round(item["score"], 4), "|", item["source"], "|", item["text"]) Notice that the question uses the words "get my money back", but none of the documents contain these exact words. The most relevant document talks about refunds. Because vector search compares meaning and not keywords, the refund document should appear at the top of the results. This is the behavior we want in a RAG system. Step 6: Build the RAG Prompt The retrieval part is now complete. The final step is to take the retrieved text and build a prompt for the language model. We do not call any specific model here, because you may use Azure OpenAI, an Anthropic model, an OpenAI model, or any other. We only prepare the input. Python def build_prompt(question, k=3): hits = search(question, k=k) context = "\n\n".join(f"- {hit['text']}" for hit in hits) prompt = ( "You are a support assistant. Use only the context below to answer " "the question. If the answer is not in the context, say that you do " "not have enough information.\n\n" f"Context:\n{context}\n\n" f"Question: {question}\n" "Answer:" ) return prompt print(build_prompt("how do I get my money back")) The output is a prompt that contains the question and the most relevant pieces of text. You would now send this prompt to your language model, and the model would generate a grounded answer. This is the complete retrieval-augmented generation flow, with Azure AI Search acting as the vector store and retriever. Using Azure OpenAI Embeddings Instead In this tutorial, we generated embeddings on our own machine. In many Azure projects, teams use Azure OpenAI embedding models instead, such as text-embedding-3-small. The change is small. You call the Azure OpenAI client to create the embedding, and you set the index dimensions to match the model (for example, 1536 for text-embedding-ada-002). The rest of the index and query code stays the same. Python from openai import AzureOpenAI client = AzureOpenAI( api_key="<your-azure-openai-key>", api_version="2024-10-21", azure_endpoint="https://<your-resource>.openai.azure.com/", ) def get_embedding(text): response = client.embeddings.create( input=text, model="text-embedding-3-small", # your deployment name ) return response.data[0].embedding Azure AI Search also supports a feature called integrated vectorization. With this feature, the service itself calls an Azure OpenAI model to convert text into vectors at indexing time and at query time, so you do not have to generate the embeddings in your own code. This is convenient for larger pipelines, but the basic flow shown above is enough to understand how vector search works. A few practical notes When you move from this small example to a real system, please keep the following points in mind. Choose your embedding model carefully, because it decides the dimension of your vectors and the quality of your retrieval. Add a chunking step so that long documents are split into focused passages. If you need both keyword matching and meaning-based matching, use hybrid search by passing a value to search_text together with the vector query; Azure AI Search will combine the keyword (BM25) results and the vector results. For even better ordering, you can enable the semantic ranker, which re-ranks the top results using a language model. Finally, monitor your index size, because vector indexes consume memory; if storage becomes a concern, look at vector compression and binary vectors. Conclusion We have seen what vector search is, why a RAG system needs it, and how Azure AI Search provides it through vector fields, the HNSW and exhaustive KNN algorithms, and the profile-based configuration. We then built a small but complete example: we created a search service, defined a vector index, embedded and uploaded a few documents, searched them by meaning, and assembled a RAG prompt. We also saw how to switch to Azure OpenAI embeddings. You can now extend this example with your own documents, a chunking step, a hybrid search, and a language model of your choice to build a full RAG application. References Quickstart: Vector search in Azure AI Search — Microsoft Learn: https://learn.microsoft.com/en-us/azure/search/search-get-started-vectorCreate a vector index — Microsoft Learn: https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-create-indexVector search overview — Microsoft Learn: https://learn.microsoft.com/en-us/azure/search/vector-search-overviewHybrid search overview — Microsoft Learn: https://learn.microsoft.com/en-us/azure/search/hybrid-search-overviewIndex binary vectors (memory optimization) — Microsoft Learn: https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-index-binary-dataazure-search-documents Python SDK — PyPI: https://pypi.org/project/azure-search-documents/Python vector search sample — Azure SDK for Python (GitHub): https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/search/azure-search-documents/samples/sample_vector_search.pyAzure AI Search vector samples — GitHub: https://github.com/Azure/azure-search-vector-samples
This article is part 4 of a 4-part series on 'Engineering Closed-Loop Graph-RAG Systems.' The simplest method to evaluate a RAG system is by asking yourself if your generated answer is correct. But that's just not enough. A Graph-RAG system may return correct answers with wrong reasons. It could have returned incorrect evidence; however, based on that incorrect evidence, the system guessed the right thing. It may generate an excellent recommendation based on incorrect criteria. It may perform well in a mini-demo environment and then fail when subject to high latency, out-of-date data, and/or back-and-forth loop requirements. Although answer-quality evaluation has its place in determining how well a system answers simple questions, when considering large-scale Graph-RAG-based workflow systems, there need to be multiple layers of evaluation. This paper reviews a multi-layered evaluation methodology for both Graph-RAG systems and closed-loop LLM systems. Layer 1: Retrieval Quality You must first determine what information the system was able to pull before evaluating the generated answer. For flat RAGs (i.e., non-graphic), the most typical evaluations of retrieval will be examining which documents/chunks make up the top k list of retrievals. However, when using graph-based models, such as Graph-RAGs, you will want to examine which nodes, edges, and paths were retrieved. Some useful metrics are: Precision@kRecall@kMRR@kNode recallEdge recallPath correctnessEvidence coverage For example, if the user asked about missed escalation behaviors, the system would expect to find more than some form of generic troubleshooting document. It might expect to see something along these lines: Markdown Interaction Record → Performance Gap → Escalation Policy → Training Resource → Assessment Item When this path is not found in the evidence, it doesn't matter how good the generated answer appears to be; it is likely going to be weak. Layer 2: Relationship-Based Reasoning You should be evaluating Graph-RAG Systems for whether relationships improved their reasoning abilities. Ask questions like: Did the system identify the correct entity?Did it traverse the right relationship?Did it avoid irrelevant neighboring nodes?Did it distinguish prerequisite, correlation, ownership, and policy relationships?Did it explain the evidence path clearly? One of the most common ways a system can fail is to retrieve related nodes that do not contribute to the solution. Because two nodes may appear adjacent to one another, it does not necessarily mean they should affect the solution. For example, Markdown Account Lockout → Password Reset Guide would be suitable for some forms of basic troubleshooting whereas, Markdown Repeated Account Lockout → Severity Signal → Escalation Policy may be better suited for evaluating performance. Both paths have relevance. One of them is more accurate to the actual question posed by the end-user. Layer 3: Answer Generation Quality After reviewing retrieval, you will then review the generation quality of the produced response. When producing general responses, consider: Factual correctnessCompletenessClarityGrounding in retrieved evidenceAbsence of unsupported claimsAppropriate uncertainty In addition to those items above, you should also consider: Fit to the detected problemSpecificityActionabilityPersonalizationMeasurable next stepTone and usefulness As previously stated, there is a difference between those factors listed above. Although an answer may be correct, it does not guarantee that it is actionable. Conversely, although a recommendation may be actionable, it does not ensure that it is properly grounded. A useful recommendation format is: Markdown Finding: What issue was detected? Evidence: What observation supports it? Recommendation: What should happen next? Measurement: How will improvement be verified? That structure makes evaluation easier because each part can be checked separately. Layer 4: Rules of Compliance Rules of compliance can be critical to ensuring users receive appropriate recommendations. In addition to being factually correct, a recommendation could violate an organization's policy, roles, or other constraints. Organizations have also expressed interest in separating answer quality from rules of compliance. Here are examples of additional measures to check: Markdown Does the answer cite supporting evidence? Does the recommendation match the user’s role? Does it avoid unsupported claims? Does it include a measurable next step? Does it avoid resources the user already completed? Does it require human approval? Here is an easy way to create an evaluation record: JSON { "response_id":"resp_2044", "answer_correct":true, "evidence_supported":true, "role_appropriate":true, "measurable_next_step":false, "overall_rule_compliance":false } Although this response may be correct, it failed to comply due to the absence of a measurable next step. This distinction is essential for determining readiness for commercialization. Layer 5: Expert and User Value Automated metrics are very valuable; however, no matter how good they are, they cannot completely replace expert judgment. Domain experts in a business setting can identify potential problems in a system's recommendations. These issues will typically fall into one of the following categories: The recommendation is technically correct but unrealistic.The evidence is weak.The system missed an important contextual clue.The response is too generic.The next step is measurable but not meaningful. Use a simple scoring system. The following provides a basic template Markdown 1 = Not useful or unsafe 2 = Partially relevant but weak 3 = Acceptable with edits 4 = Useful and mostly ready 5 = Strong, specific, and ready to use If possible, obtain reviewer comments rather than simply scores. Comments can provide insight regarding where to make improvements. Layer 6: Latency and Dependability Graph-RAG-based systems can become extremely slow if retrieval is not managed properly. Measure latency during the following phases: Markdown Entity extraction latency Graph traversal latency Vector search latency Reranking latency Prompt construction latency LLM generation latency Rule validation latency Total response latency It is recommended that you not base your decision solely on averages. It is also important to track P50, P95, and P99 values. If your testing demonstrates low latency within a small scope, latency can increase significantly as the graph grows and/or more complex retrieval occurs, or as the complexity of the validation rules increases. Additionally, measure dependability through: Retrieval timeout rateEmpty retrieval rateEntity linking failure rateRule validation failure rateLLM retry rateHuman escalation rate Your architecture may appear operational, but these statistics provide insight into whether your design is operable. Layer 7: Closed-Loop System Health Evaluating closed-loop systems requires its own form of evaluation. If your system uses feedback to learn, determine whether that learning is both safe and beneficial. Evaluate: Feedback volume by typeFeedback classification accuracyPercentage routed to human reviewApproved vs. rejected graph updatesPrompt or rule changes after feedbackRollback frequencyPerformance before and after updatesDrift by domain or user segment User ratings can be unreliable. Therefore, a feedback loop should be evaluated based on more than whether user ratings rise. While user ratings may rise, indicating an increase in pleasantries toward the system, the system's accuracy may decrease. For high-stakes or structured workflows, expert-approved improvement matters more than raw engagement. A Practical Evaluation Table Here is a simple table structure teams can use: Markdown Evaluation Layer Example Metric Failure Example Retrieval Quality MRR@10, node recall Right answer, wrong evidence Graph Reasoning Path correctness Wrong relationship used Generation Quality Expert score, groundedness Unsupported claim Rule Compliance Rule pass rate Missing measurable next step Usefulness Expert rating Correct but too generic Latency P95 total response time Graph traversal too slow Feedback Loop Health Approved update rate Noisy feedback changing graph This table helps teams avoid over-indexing on one metric. Example Evaluation Harness Here is a lightweight evaluation structure: Python from dataclasses import dataclass from typing import List @dataclass class EvalCase: query: str expected_nodes: List[str] expected_edges: List[str] expected_answer_points: List[str] required_rules: List[str] @dataclass class EvalResult: node_recall: float edge_recall: float answer_score: float rule_compliance: float latency_ms: int def recall(expected: List[str], actual: List[str]) -> float: if not expected: return 1.0 return len(set(expected) & set(actual)) / len(set(expected)) def evaluate_case(case: EvalCase, system_output: dict) -> EvalResult: node_recall = recall(case.expected_nodes, system_output["retrieved_nodes"]) edge_recall = recall(case.expected_edges, system_output["retrieved_edges"]) answer_score = recall( case.expected_answer_points, system_output["answer_points"] ) rule_compliance = recall( case.required_rules, system_output["passed_rules"] ) return EvalResult( node_recall=node_recall, edge_recall=edge_recall, answer_score=answer_score, rule_compliance=rule_compliance, latency_ms=system_output["latency_ms"] ) This is not a full evaluation framework, but it shows the principle: evaluate retrieval, graph reasoning, generation, rules, and latency separately. Do Not Hide Limitations One habit that improves trust is being explicit about limitations. If the evaluation uses synthetic data, say so. If the system has not been tested in live production, say so. If expert review was limited to a small sample, say so. If the graph schema was manually designed, say so. This does not weaken the article or the system. It makes the work more credible. For example: Markdown This evaluation used a synthetic, expert-annotated dataset. The results are useful for comparing architecture variants, but they should not be interpreted as proof of production performance. Stating this will help readers understand what the scope of your work is. Final Thoughts Systems developed with Graph-RAG should be evaluated differently from how you would evaluate a simple chatbot. What is correct in terms of accuracy isn't the only thing that matters; however, understanding if a system can reach an accurate result is only part of the equation. Can a system find the right nodes? Can a system correctly traverse through relationships? Does the system properly cite evidence? Are there rules the system follows? Will experts consider the recommended solution useful? Is the latency acceptable to meet requirements? Will feedback cause changes to occur safely in the system? These are the types of questions that distinguish a potential good demo from a production-ready workflow supporting system.
A Deep Dive into Tracing Agentic Workflows (Part 2)
June 10, 2026 by
Conversational Risk Accumulation: Stateful Guardrails Beyond Single-Turn LLM Checks
June 15, 2026 by
I Reverse-Engineered 50 API Breaches. The Same Five Mistakes Keep Appearing.
June 15, 2026
by
CORE
Operationalizing Enterprise AI at Scale: Architecture, Governance, and Adoption
June 12, 2026
by
CORE
June 12, 2026 by
A Spring Boot App With Half the Startup Time
June 12, 2026 by