Mocking Kafka for Local Spring Development
Lambda-Driven API Design: Building Composable Node.js Endpoints With Functional Primitives
Platform Engineering and DevOps
Platform engineering and DevOps are merging as organizations scale, modernize, and push to reduce cognitive load across increasingly complex systems. What began as fragmented internal tooling has evolved into Platform-as-a-Product thinking, where internal developer platforms (IDPs), automation pipelines, and golden paths provide the backbone of modern DevOps workflows. Platform teams, DevOps engineers, security teams, and SREs are now working together to deliver consistent, secure, and self-service experiences that improve developer productivity and satisfaction and reinforce operational reliability.This report examines how platform engineering is reshaping DevOps by standardizing environments, unifying toolchains, and shifting repetitive tasks into automated workflows. We explore how teams are implementing developer experience (DevEx) metrics, rethinking CI/CD pipelines, and leveraging AI-driven automation to optimize infrastructure performance and enhance delivery velocity. As enterprises link platform health to business outcomes, measuring ROI and platform adoption is becoming a core initiative.
Shipping Production-Grade AI Agents
Threat Modeling Core Practices
TL;DR: Why A Former Micromanager Will Make AI Adoption Work Twenty years of Agile coaching failed to fix the micromanager who meddles with every draft, every meeting, every decision. This article shows where their distrust stops damaging teams and starts producing the verification work AI adoption actually needs. Welcome the Verification Architect! What Is a Verification Architect? A Verification Architect is the person responsible for deciding which AI tasks belong in Assist mode, which belong in Automate mode, and which belong in Avoid mode of the A3 framework; defining what review means in each mode; and running the verification loop that converts each AI failure into a sharper prompt, eval, or acceptance criterion. The role is not a compliance auditor: compliance asks whether rules were followed, while verification asks whether the system produces the claimed outcome under the conditions in which it operates. In smaller organizations, the work is often a responsibility carried by a Product Manager, Scrum Master, QA lead, or technical lead, rather than by someone holding the title. Learn more about why a micromanager might be an excellent fit for this role below. The Micromanager You know the type of manager: The micromanagers ask to see the draft before the team talks to the customer. They rewrite the acceptance criteria after refinement. They join the Slack thread “just to clarify” and leave with the decision back in their hands. They are not malicious. They genuinely believe the work needs their eyes before it ships. For 20-plus years, Agile coaches have tried to convince these people to trust the team, the people they hired themselves. The psychological safety workshops did not work. The servant-leadership reading lists did not work. Much of the coaching industry learned to work around this population and focus on the trainable middle. The micromanagers stayed. Now the same manager is being asked to delegate work to AI. They will not delegate without asking. But this time, their skepticism deserves a hearing. The Micromanagement Disposition Is Not the Defect There is a reason the AI industry uses the phrase human in the loop. Probabilistic systems running autonomously should not be trusted by default with consequential decisions in their current form. They hallucinate citations. They produce confident wrong code. They will follow an under-specified instruction into a wall and report success. The instinct to verify before accepting consequential output is not a defect in this domain. It is reliability engineering. This context exposes the problem with the standard Agile framing. Telling a chronic skeptic that they need to trust more works against the evidence. The skeptic micromanager looking at agentic AI sees what the engineers building it see: a powerful tool with known failure modes that has to be wrapped in observability, harnesses, evals, and verification before it produces reliable value. The skeptic’s posture toward AI is closer to reliability engineering than to the optimism that much AI adoption theater demands. Where the same instinct fails is with human colleagues, not because humans are reliably better than generative AI systems. Humans fail differently. The reason inspection often damages human work but can improve AI work is that inspection changes the system being inspected. People learn, adapt, withdraw, hide information, and protect themselves in response to how they are treated. Surveillance degrades the very capability the manager claims to protect. With AI, verification does not demotivate the model. The model produces what it produces, and the verification loop sharpens over time, as we feed back findings to improve prompts, skills, evals, constraints, and operating rules. From that perspective, the problem was never the micromanager’s distrust. The problem was where it was pointed: at humans. Two Patterns Wearing the Same Costume Two very different micromanager motives can produce the same behavior. The distinction matters because they respond to different interventions, and one of them is genuinely useful in an AI context while the other is not: The first pattern shows up as authority maintenance: The distrust is about keeping the decision in the manager’s hands, not about improving the output. Ask this manager what would count as evidence that a teammate’s work is trustworthy, and the answer is often operational nonsense: “I need to see it first.” The verification, when it happens, is performative. What gets inspected is compliance, not risk. AI tooling does not help this person because they do not actually want better evidence. They want to be the one who decides.The second pattern shows up as accumulated experience: The distrust is grounded in specific past failures. This manager can describe in detail what they have seen go wrong, what was promised and not delivered, and which verification step was skipped before the failure. With human teammates, this manifests as micromanagement because verifying human judgment is socially costly. You cannot run a unit test on a colleague’s reasoning. So they over-supervise, the team feels controlled, and the relationship degrades. With AI, verification is structured and cheap. The same disposition that damages a team produces useful work when pointed at a probabilistic system that actually benefits from repeated checks. A small diagnostic helps distinguish them: Question Authority maintenance Accumulated experience What would make this output trustworthy? “I need to see it first.” “It has to pass these three checks.” What failure are you trying to prevent? Vague loss of control. A specific failure mode they can name. When would you stop reviewing every step? Never. When the system demonstrates reliability under defined conditions. What do you inspect? The person’s compliance. The work product’s risk. What changes after your review? The decision returns to me. The system gets a sharper check, rule, prompt, or acceptance criterion. The difference is not whether the person distrusts. The difference is whether their distrust leaves behind better evidence, better criteria, and a sharper system, or merely a returned decision right. This is not permission to allow the micromanager to “direct” humans. Human work still needs verification, but the verification must be designed as a social contract: clear intent, explicit constraints, agreed-upon review points, and decision rights that do not silently migrate upward whenever the manager feels anxious. The same person who becomes useful in AI verification may still be destructive in a team context if they cannot make that shift. The disposition is not the license. The redirected target, however, provides a new perspective for the micromanager. A3 Is the Sorting Mechanism The A3 Framework (Assist, Automate, Avoid) is one way to test which pattern you are looking at. Authority maintenance can fill in the A3 boxes. It cannot use A3 honestly. The answers stay vague, reversible, and dependent on the micromanager’s comfort rather than on named risks. The accumulated-experience pattern can categorize a task in seconds, because the suspicion is grounded in specific past failures that map to specific risk profiles. In Assist, where AI drafts and a human decides, the contribution is defining what a genuine review looks like. Most teams using AI in Assist mode are rubber-stamping. The experienced skeptic refuses to. They will read the draft and tell you which two of the five suggestions contradict a constraint the model could not have known about. In Automate, where AI executes under explicit rules and audit cadences, the same person designs the audit. They will write the acceptance criteria with teeth, the failure modes worth alerting on, the rollback conditions, and the sample size for the weekly check. The team may look slower for two weeks because the work is finally visible. Six months later, that visibility is what prevents the incident everyone else would have called “unexpected.” In Avoid, where AI should not be used at all, the skeptic is the person qualified to make the call. Most organizations lack this authority. Optimistic adopters struggle to say no. Blanket skeptics say no too cheaply. The experienced skeptic can distinguish a stakeholder relationship in which one wrong AI-drafted phrase costs six months of trust from a low-stakes draft in which Assist is fine. The categorization is not the value in this case, but the decision authority is. Many AI adoption initiatives lack a qualified person with the authority to say we should not use this here, and they produce predictable failure modes as a result. Summary: AI Task Types and the Verification Mode Each Require Bound drafts a human reviews: A3 mode: Assist.What the Verification Architect does: Defines the specific criteria the draft must pass before acceptance. Repeated execution under explicit rules: A3 mode: Automate.What the Verification Architect does: Designs audit cadences, rollback conditions, and drift detection. High-trust or irreversible work: A3 mode: Avoid.What the Verification Architect does: Protects the boundary against convenience-driven AI adoption. Name the New Role for the Micromanager: The Verification Architect The piece this article has been circling is that AI creates a role the Agile movement never learned to name. Call it the Verification Architect. A Verification Architect does not ask: “Can AI do this?” They ask: “What would have to be true for AI to do this safely, repeatedly, and measurably in our context?” Their unit of work is not the prompt. It is the loop, the day-to-day work that compounds over months: Turn vague AI use cases into Assist, Automate, or Avoid decisions before anyone opens a prompt window.Define what review means in Assist mode, not as a vibe check, but as specific criteria the draft has to pass.Design audit cadences in Automate, including sample sizes, drift detection, and rollback conditions.Protect Avoid zones from convenience-driven erosion, which is the failure mode of every governance regime that lacks an enforcer.Convert each failure into a sharper prompt, a new eval, a tightened acceptance criterion, or an updated Definition of Done.Track drift over time, because models, data, and use cases all move. In smaller organizations, this may not be a job title. It may be a responsibility carried by a Product Manager or Owner, a Scrum Master/Agile Coach, a QA lead, a product operations person, or a technical lead. The title matters less than the loop. The Verification Architect is not a compliance role. Compliance asks whether the rules were followed. Verification asks whether the system produces the claimed outcome under the conditions in which it operates, with the named failure modes. The first is bureaucracy. The second is engineering judgment. The role is not new in the strict sense. Reliability engineers, design verification architects, and rigorous product operations leaders have been performing this work on traditional software for years. What is new is the application to AI-enabled work systems in non-technical organizational settings, where agentic workflows with non-deterministic outputs and rapid deployment cycles make verification load-bearing rather than nice-to-have. The organizations that ship AI without this capability produce demos. The organizations that build it produce systems that compound. The Work Inside the Dip The AI Spending Trap argued that organizations are often stuck in the J-curve dip because they buy tools and skip the intangible-capital investment that drives the eventual rise. The argument has a missing piece. The intangibles do not invest themselves. They need process redesign, retraining, restructuring, data plumbing and governance, and change management. Every category gets paid for by specific humans doing specific work. The part of the dip organizations most consistently underprice is verification work, eval design, output review, prompt or skill refinement, acceptance-criteria sharpening, and failure-mode cataloging. This is the place where the Verification Architect earns their salary. Done well, the loop becomes a compounding system. Each verification cycle encodes a little more organizational judgment about what good looks like in this specific context; the evals get sharper, and the acceptance criteria get more specific. The agent’s effective competence in this organization increases over time, not because the underlying model improves, but because the surrounding system encodes accumulated knowledge of where it fails. The trusting person ships v1 and moves on. The Verification Architect ships v1, watches it, catches the failures, refines the prompts, tightens the evals, updates the Definition of Done, and runs the loop again. Without this person, the deployment stays at v1 and degrades as conditions shift. With them, the system gets better while the headcount stays flat. That is the curve “The AI Spending Trap” described, and this is who pulls it upward. The work is currently underpriced. Eval design does not ship on Monday. Output review does not produce a launch announcement. Refining prompts in month four produces nothing that the quarterly board deck can show. That is exactly why the disposition is a competitive advantage for organizations that recognize it before the rest of the market does. A Warning About the Label The label “Verification Architect” will be hollowed out, as every useful role title in this industry eventually is. (Remember: Agile Coach, Product Owner, and Scrum Master?) Ask what the person last sent back for revision and why. Ask what they last protected from AI involvement and what would have to change for that decision to flip. Ask what their longest-running audit loop has caught. The genuine Verification Architect answers with names, dates, and specific failures. The fake one answers with frameworks and vocabulary. Conclusion: Move the Work, Not the Person If you have spent your career being told your skepticism was a problem, consider that the people telling you were trying to fit you as a micromanager into a role that does not need you. The agentic AI stack needs people who refuse to trust output they did not verify. It needs people who design the evals, who run the audit loop, who notice the failure that everyone else celebrated as a launch. The work is currently underpriced. That is the opportunity. The micromanager disposition was never the problem; shoehorning it into an unfitting role was. Pick a teammate you struggled to delegate to in the last six months. Pick an AI task that frustrated you in the same window. Compare the instructions you gave each. If the pattern is the same, you have found the problem. One system is being damaged by your inspection. The other may finally be receiving the discipline it needs. Does your distrust produce evidence, or does it merely preserve authority? My suggestion: Move the work, not the person. Key Questions This Article on Micromanagers Answers What Is a Verification Architect in AI Adoption? A Verification Architect is the person who decides which AI tasks belong in Assist, Automate, or Avoid mode, defines what review means in each mode, and runs the verification loop that converts each AI failure into a sharper prompt, eval, or acceptance criterion. Their unit of work is not the prompt; it is the loop. In smaller organizations, the responsibility may be carried by a Product Manager, Scrum Master, QA lead, or technical lead rather than someone holding the title. Why Do Micromanagers Struggle to Delegate to AI? Most do not, because their underlying distrust of probabilistic systems is engineering common sense, not a character defect. The reason inspection damages human teams but improves AI systems is that inspection changes the system being inspected: people adapt and withdraw under surveillance, models do not. The skeptic’s posture toward AI is closer to reliability engineering than to the optimism that much AI adoption theater demands. How Can I Tell If My Distrust Is Useful Verification or Authority Maintenance? Apply a five-question diagnostic. Useful verification can name a specific failure mode it prevents, define operational criteria for when to stop reviewing, assess the work product’s risk rather than the person’s compliance, and leave behind a sharper rule, prompt, or acceptance criterion after each review. Authority maintenance cannot answer those questions in operational terms; its only output is returning the decision to the reviewer. Who Does the Verification Work that Makes AI Adoption Compound over Time? The Verification Architect. The work includes eval design, output review, prompt and skill refinement, acceptance criteria sharpening, and failure-mode cataloging. Each cycle encodes more organizational judgment about what “good” looks like in a specific context, so the system’s effectiveness improves over time even when the underlying model does not. Without this person, deployments stay at v1 and degrade as conditions shift.
This is not "just another article about Springdoc," I promise. This is a ready-to-use recipe I was struggling to find one day, and had to build it from scratch. Have you ever needed to generate OpenAPI documentation directly from your code and, more importantly, do it in a way that fits cleanly into a CI pipeline? Swagger UI is commonly used in Spring Boot applications to visualize and test APIs from the browser. It can also expose the generated OpenAPI definition through a configurable endpoint, and that endpoint is exactly what we will use in this article. Why OpenAPI Documentation Matters Frontend Client Generation One of the most practical uses of OpenAPI documentation is automatic client generation. Tools such as OpenAPI Generator or Swagger Codegen can take an OpenAPI definition and produce TypeScript, JavaScript, or Java clients with very little manual effort. Mocking a Service Before It Is Ready In early development stages, a team may want to spin up a mock server before the real endpoints are fully implemented. Tools such as Mockoon or WireMock can use an OpenAPI specification to simulate the service. This is especially useful for frontend teams that need to move forward while backend work is still in progress. Verifying Contracts Between Services When multiple services depend on one another, compatibility becomes critical. OpenAPI documentation can be used together with tools such as Spring Cloud Contract to verify that both providers and consumers still conform to the agreed contract. The Manual Approach to Generating OpenAPI Documentation Let us start with a simple Spring Boot project. Add the following dependencies to pom.xml: XML <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-security</artifactId> </dependency> <dependency> <groupId>org.springdoc</groupId> <artifactId>springdoc-openapi-starter-webmvc-ui</artifactId> <version>2.6.0</version> </dependency> Then add Springdoc configuration to application.yml: YAML springdoc: api-docs: path: /api-docs enabled: true swagger-ui: url: /api-docs enabled: true Now create a simple REST controller: Java @RestController @Tag(name = "default", description = "General API") @RequestMapping("/api/v1/default") public class WebRestController { private static final Logger log = LoggerFactory.getLogger(WebRestController.class); @GetMapping(produces = MediaType.TEXT_PLAIN_VALUE) @ResponseStatus(HttpStatus.OK) public String get() { log.info("GET method called"); return "Hello!"; } @PostMapping( consumes = MediaType.TEXT_PLAIN_VALUE, produces = MediaType.APPLICATION_JSON_VALUE ) @ResponseStatus(HttpStatus.OK) public Set<String> post(@RequestBody String body) { log.info("POST method called"); return Set.of(body); } Finally, add a security configuration that allows access to both the REST API and to Swagger UI: Java @Configuration @EnableWebSecurity @EnableMethodSecurity public class WebSecurityConfig { @Profile("!openapi") @Bean public SecurityFilterChain filterChain(HttpSecurity httpSecurity) throws Exception { return httpSecurity.authorizeHttpRequests( request -> request .requestMatchers("/api-docs", "/api-docs/**").permitAll() .requestMatchers("/swagger-ui/*").permitAll() .requestMatchers("/api/v1/default").permitAll() .requestMatchers("/**").authenticated() ) .csrf(CsrfConfigurer::disable) .build(); } @Profile("openapi") @Bean public SecurityFilterChain filterChainOpenApi(HttpSecurity httpSecurity) throws Exception { return httpSecurity.authorizeHttpRequests( request -> request.anyRequest().permitAll() ) .csrf(CsrfConfigurer::disable) .build(); } Notice the separate openapi profile. We will use it later during automated generation. At this point, you can run the application and open Swagger UI at http://localhost:8080/swagger-ui/index.html. From there, the generated OpenAPI document is available at http://localhost:8080/api-docs. You can save that response manually and use it as your specification file. This works, but it is repetitive and not very practical for build automation. So let us move to the more useful approach: generating the spec during the Maven build. Automatic Generation To generate an OpenAPI file automatically, it helps to understand what actually happens during the build. The springdoc-openapi-maven-plugin does not generate the specification out of thin air. It calls the application endpoint that exposes the OpenAPI definition. In other words, your Spring Boot application must be running while the plugin executes. That is why the spring-boot-maven-plugin and springdoc-openapi-maven-plugin are typically used together. Because the application has to be started during the build, the security configuration must also allow the documentation endpoint to be accessed in that scenario. This is exactly why the separate openapi Spring profile is useful. Add a Dedicated Maven Profile Add the following Maven profile to pom.xml: XML <profile> <id>openapi</id> <properties> <maven.test.skip>true</maven.test.skip> </properties> <build> <plugins> <!-- When the Maven profile is openapi, run Spring with the openapi profile --> <plugin> <artifactId>spring-boot-maven-plugin</artifactId> <groupId>org.springframework.boot</groupId> <configuration> <jvmArguments> -Dspring.application.admin.enabled=true -Dspring.profiles.active=openapi </jvmArguments> </configuration> <executions> <execution> <id>pre-integration-test</id> <goals> <goal>start</goal> </goals> </execution> <execution> <id>post-integration-test</id> <goals> <goal>stop</goal> </goals> </execution> </executions> </plugin> <!-- Generate the OpenAPI file during the build --> <plugin> <artifactId>springdoc-openapi-maven-plugin</artifactId> <groupId>org.springdoc</groupId> <version>1.4</version> <configuration> <skip>false</skip> <apiDocsUrl>http://localhost:8080/api-docs.yaml</apiDocsUrl> <outputDir>${project.build.directory}</outputDir> <outputFileName>openapi.yml</outputFileName> </configuration> <executions> <execution> <id>integration-test</id> <goals> <goal>generate</goal> </goals> </execution> </executions> </plugin> </plugins> </build> </profile> The important parts here are: We create openapi Maven and openapi Spring profiles, but they are not the same (and should not necessarily have those exact names or share one name).When openapi Maven profile is run, we run Spring app with openapi profile (look at jvmArguments)-Dspring.profiles.active=openapi enables the relaxed security profile created specifically for documentation generation.apiDocsUrl points to the endpoint that returns the OpenAPI document.outputDir and outputFileName control where the generated file is written. These are the exact parts I struggled to find in one place, hence the "recipe" article. Run the Generation Step Once the profile is in place, generating the spec is easy: Shell ./mvnw verify -Popenapi After the build completes, the generated OpenAPI spec should be here: YAML ./target/openapi.yml Using It in a CI Pipeline This setup is CI-friendly because the same command can run locally and in your pipeline: YAML ./mvnw verify -Popenapi From there you can archive target/openapi.yml as a build artifact, publish it to an artifact repository, pass it to frontend code generators, mock servers, and contract verification jobs. Conclusion Generating OpenAPI documentation manually from Swagger UI is fine for quick inspection, but it does not scale well when you need repeatability. By wiring Spring Boot and Springdoc into a dedicated Maven profile, you can generate the specification automatically during the build in your CI. That gives you a reliable OpenAPI artifact that can support client generation, service mocking, and contract verification without adding a separate manual step to the development workflow. Bonus: Represent Set as an Array In some cases, you may want a Set to be represented as a regular array in the generated OpenAPI specification instead of an array with uniqueItems: true. This can be useful when downstream tools expect a plain array schema (this is the exact request I once got from the frontend team). You can customize Springdoc behavior with a small configuration class: Java import org.springdoc.core.utils.SpringDocUtils; import io.swagger.v3.oas.models.media.Schema; import java.util.Collections; import java.util.Set; public class SwaggerConfig { // Make springdoc generate an Array schema for Set.class // and remove uniqueItems: true public SwaggerConfig() { var schema = new Schema<Set<?>>(); schema.type("array").example(Collections.emptyList()); SpringDocUtils.getConfig().replaceWithSchema(Set.class, schema); } With this adjustment in place, the generated schemas for Set will be emitted as an array, which can simplify integration with some client generators and consumers.
Modern software systems rarely fail due to poor coding skills. Most failures occur when teams lose sight of the business problem they are addressing. As systems evolve, requirements shift, teams expand, and new integrations are added, codebases often become collections of technical decisions that lack business context. Classes become generic managers and services, methods devolve into procedural scripts, and communication between developers and domain experts diminishes. Tactical Domain-Driven Design (DDD) addresses this issue by emphasizing software that directly reflects business language in code, rather than focusing solely on infrastructure or frameworks. The term “semantic” comes from the Greek semantikos, meaning “significant” or “meaningful,” which is central to Tactical DDD. The objective is not just to reorganize classes, but to ensure code communicates intent clearly to both engineers and business experts. In modern Java systems, where complexity increases due to distributed architectures, integrations, and ongoing business changes, this clarity is essential for long-term maintainability. Tactical DDD provides practical patterns, such as entities, value objects, aggregates, repositories, factories, and domain services, to preserve codebase meaning and manage complexity. This article will examine these patterns step by step using Java and a soccer championship scenario to show how semantic code improves system understanding, evolution, and maintenance. Entity Before applying Tactical DDD patterns, it is important to recognize that they should not be the starting point of the design process. A common mistake in software projects is to begin with entities, repositories, and aggregates without first understanding the business. Tactical patterns serve as implementation tools, not discovery tools. Strategic DDD should begin with defining domain boundaries, the ubiquitous language, and the business context. Only after clarifying the problem space should you translate that understanding into code using tactical patterns. An Entity is a core Tactical DDD pattern. The term originates from the Latin ens, meaning “being” or “existing thing.” In software design, it refers to maintaining its identity throughout its lifecycle. An entity is defined not by its current attributes, but by the business’s recognition of it as the same conceptual object over time. Entities are useful when the domain must track the lifecycle of something important to the business. In a soccer championship, a player is a clear example of an entity. A player may change teams, positions, salary, or statistics during a career, but the system continues to recognize the player as the same individual within the domain. Therefore, identity is more important than changes to attributes. The following Java class illustrates this concept: Java import java.util.UUID; public class Player { private UUID id; private String name; private Position position; public Player(UUID id, String name, Position position) { this.id = id; this.name = name; this.position = position; } } The Player class demonstrates the Entity pattern by using a unique identifier in the id field. This identifier enables the application to distinguish one player from another, regardless of changes to other attributes. While name and position may change, the identity remains constant. This characteristic defines the object as an entity rather than a simple data structure or value object. Value Object Entities are defined by identity, but not all domain concepts require lifecycle tracking or unique identification. Many business concepts describe characteristics, measurements, classifications, or immutable meanings. The Value Object pattern addresses these cases. Here, “value” means the object is defined solely by its attributes, not by identity. In Tactical DDD, value objects reduce the need for primitives and clarify the domain language within the codebase. A Value Object is an immutable object that represents a descriptive aspect of the domain. Unlike entities, two value objects with identical values are considered the same. Value objects are often used for concepts such as money, addresses, coordinates, statuses, measurements, or classifications. Their primary purpose is to improve semantic clarity and encapsulate domain rules for specific concepts. In a soccer championship scenario, player position is a good example of a value object because the application does not need to track its lifecycle. The domain is concerned only with the meaning of the value. The following Java enum illustrates this concept: Java public enum Position { GOALKEEPER, DEFENDER, MIDFIELDER, FORWARD } The Position enum applies the Value Object pattern by representing a business classification rather than using primitive strings throughout the codebase. By using explicit types instead of raw text, such as "forward" or "goalkeeper", the application improves readability, reduces invalid states, and reinforces a shared language between developers and domain experts. Factory As domain models evolve, object creation often requires more than calling a constructor. Business rules, validations, default values, and initialization steps may spread into controllers, services, or application layers, leading to duplication and fragmented domain knowledge. The Factory pattern centralizes object creation, ensuring new domain objects are valid and meaningful. A Factory is a Tactical DDD pattern that encapsulates the creation logic of entities or aggregates. Its purpose is not just to “hide the constructor,” but to express domain intent during object creation. Originating from manufacturing, factories ensure objects are assembled correctly before use. In software, this approach maintains consistency and enforces business rules during instantiation. In a soccer championship scenario, creating a player requires more than allocating memory. A new forward player must have the correct position and a unique identity. Centralizing this logic in a factory prevents duplication across the system. Java import java.util.UUID; public class PlayerFactory { public Player createFoward(String name) { return new Player( UUID.randomUUID(), name, Position.FORWARD ); } } PlayerFactory implements the Factory pattern by encapsulating the details of Player creation. The application layer does not need to manage identifier generation or position assignment for a forward player. The method name communicates business intent, allowing the code to express a meaningful domain operation rather than low-level construction details. Aggregate Root As systems scale, maintaining consistency between related entities becomes more challenging. Without clear boundaries, business rules can spread across services, repositories, and transactions. Tactical DDD addresses this by defining explicit consistency boundaries within the domain model. Aggregate and Aggregate Root patterns are essential to this approach. An Aggregate is a group of related entities and value objects managed as a single consistency boundary. The Aggregate Root serves as the main entry point, coordinating and protecting the aggregate’s internal state. In practice, the aggregate root ensures controlled modifications and maintains business rule consistency during state changes. In a soccer championship scenario, a team acts as an aggregate root by managing the lifecycle and consistency of its players. The application should not modify the player collection directly; all changes should occur through the team’s defined behaviors: Java import java.util.Collections; import java.util.List; public class Team { private TeamId id; private String name; private List<Player> players; public Team(TeamId id, String name, List<Player> players) { this.id = id; this.name = name; this.players = players; } public List<Player> getPlayers() { return Collections.unmodifiableList(players); } public void remove(Player player) { players.remove(player); } public void add(Player player) { players.add(player); } } The Team class implements the Aggregate Root pattern, managing access to its player collection. Rather than allowing direct modification, it has add and remove. This method enables business rules to evolve and ensures consistency across the application. The aggregate root safeguards the domain boundary and maintains the cycle. Repository One of the biggest challenges in enterprise applications is avoiding tight coupling between business logic and persistence concerns. Over time, SQL queries, database operations, caching logic, and infrastructure details can start leaking into the domain layer, making the code harder to maintain and evolve. Tactical DDD addresses this problem with the Repository pattern, which provides a collection-like abstraction for managing aggregates. The term “repository” originates from the Latin repositorium, meaning “a place where things are stored.” In Domain-Driven Design, a repository is not merely a DAO or a utility class for executing queries. Its primary goal is to provide access to aggregates while hiding infrastructure complexity from the domain model. A repository allows the application to work with domain concepts instead of persistence mechanisms, preserving the separation between business logic and technical implementation. In the soccer championship scenario, the application needs a mechanism to persist and retrieve teams without exposing database details to the business flow. Since Team acts as the aggregate root, the repository is responsible for managing it as a consistency boundary: Java public interface TeamRepository { Team save(Team team); } The TeamRepository applies the Repository pattern by abstracting persistence operations behind a domain-oriented contract. The application layer does not need to know whether the data is stored in PostgreSQL, MongoDB, Redis, or another technology. More importantly, the repository communicates the business intention directly through the aggregate itself. Instead of manipulating tables or records, the code works with meaningful domain concepts such as Team, preserving the semantic clarity of the model and reducing coupling between the domain and infrastructure layers. Domain Service Not every business operation fits within an entity or aggregate root. As the domain evolves, some rules involve multiple entities or coordination logic that do not belong to a single object. Assigning these responsibilities to entities can lead to bloated models and reduced cohesion. Tactical DDD addresses this with the Domain Service pattern. A Domain Service contains business logic that does not belong to a specific entity or value object, but remains part of the domain model. Its role is to execute meaningful business operations involving multiple domain objects, not to handle technical orchestration or infrastructure. In DDD, a service encapsulates domain behavior across aggregates while maintaining the model’s clarity. In the soccer championship scenario, transferring a player between teams is a business operation involving multiple aggregates. The responsibility does not belong exclusively to the player or to a single team. Instead, the operation represents a domain action coordinating both source and destination teams: Java public class TransferService { public void transfer( Player player, Team source, Team destination) { source.remove(player); destination.add(player); } } TransferService implements the Domain Service pattern by encapsulating the business logic for transferring a player between teams. This service expresses the domain concept directly, rather than spreading logic across controllers or application layers. The method communicates business intent clearly using the domain’s ubiquitous language. Instead of exposing low-level details, the code now reflects a meaningful operation recognized by both developers and business experts: transferring a player during the championship lifecycle. Domain Event In complex systems, important business actions rarely affect only a single part of the application. A change in one domain often triggers reactions in other contexts, such as notifications, analytics, integrations, auditing, or external workflows. Directly coupling these concerns creates rigid architectures in which every new requirement increases dependencies across the system. Tactical DDD addresses this challenge with the Domain Event pattern. A Domain Event represents an important event that has already occurred within the business domain. The emphasis on the past tense is intentional because events describe facts, not commands or intentions. The term “event” originates from the Latin eventus, meaning “outcome” or “occurrence.” In Domain-Driven Design, domain events allow systems to communicate meaningful business changes while reducing coupling between components and bounded contexts. Instead of directly invoking every dependent operation, the domain publishes events that other parts of the system may react to independently. In the soccer championship scenario, hiring a new player is an important business occurrence that other parts of the system may care about. The championship may want to notify fans, update statistics, trigger merchandising actions, or synchronize with external systems. Instead of embedding all these responsibilities directly into the transfer logic, the application can represent the occurrence explicitly through a domain event: Java public record NewSoccerHired( Team team, Player player) { } The event can then be published once the business operation finishes successfully: Java eventPublisher.publish( new NewSoccerHired(destination, player) ); The NewSoccerHired record applies the Domain Event pattern by representing a meaningful business fact inside the domain model. Instead of tightly coupling multiple responsibilities, the system now exposes a semantic business occurrence that other parts of the architecture can react to independently. This approach improves extensibility, reduces direct dependencies, and preserves the ubiquitous language across the application lifecycle. Application Service As systems evolve, business operations often require coordination across domain components, persistence, and integration points. Without a clear orchestration layer, this logic may spread across controllers, APIs, and infrastructure classes, resulting in tightly coupled, hard-to-maintain applications. Tactical DDD addresses this with the Application Service pattern. An Application Service orchestrates use cases and coordinates domain operations. Unlike domain services, which encapsulate business rules, an application service manages the execution flow of business actions. In DDD, it serves as the coordination layer, connecting repositories, domain operations, and external interactions, while keeping the domain model focused on business behavior. In a soccer championship scenario, transferring a player between teams requires more than one business rule. This operation coordinates transfer logic, persistence, and event publication. The following class centralizes this orchestration in a single use case: Java public class TransferPlayerUserCase { private final TeamRepository teamRepository; private final TransferService transferService; private final EventPublisher eventPublisher; public TransferPlayerUserCase( TeamRepository teamRepository, TransferService transferService, EventPublisher eventPublisher) { this.teamRepository = teamRepository; this.transferService = transferService; this.eventPublisher = eventPublisher; } public void execute( Player player, Team source, Team destination) { transferService.transfer(player, source, destination); teamRepository.save(source); teamRepository.save(destination); eventPublisher.publish( new NewSoccerHired(destination, player) ); } } The TransferPlayerUserCase demonstrates the Application Service pattern by orchestrating the entire player transfer process. Rather than placing orchestration logic in controllers or entities, this class coordinates domain operations, persistence, and event publication within a single workflow. The method represents a meaningful business action within the domain: transferring a player between teams during the championship. Conclusion Tactical Domain-Driven Design does not aim to add unnecessary complexity or apply patterns indiscriminately. Its purpose is to help engineers build software that clearly communicates business meaning through code. By introducing concepts such as entities, value objects, factories, aggregates, repositories, domain services, domain events, and application services, developers create systems that are easier to understand, maintain, and adapt as business needs evolve. Tactical DDD also bridges the gap between technical and business perspectives, making code a semantic representation of the domain. This article introduces core Tactical DDD patterns using Java and a soccer championship scenario. The aim is not to cover every aspect of Domain-Driven Design, but to show how these patterns help build expressive and maintainable systems. As projects become more complex, preserving business meaning within the codebase is increasingly important, especially in modern distributed architectures where technical complexity can obscure domain language.
TL;DR: Token Economics in the Era of Scarcity Your Claude Pro subscription hits limits faster than it did in January, as Anthropic quietly re-priced the ceiling, and every AI provider is rationing compute. If you keep working with Claude the way you did six months ago, you are in for a rude awakening. This article gives you four principles that explain how Token Economics actually works, so you can stop accepting the black box and start using your budget deliberately. Token Economics Principle 1: Every Turn Re-Consumes Everything Before It Claude does not remember your conversation the way a human colleague does. Every time you send a message, Claude reads the entire conversation again from the top: your first question, Claude’s first answer, your second question, and so on. Message 30 pays to re-read messages 1 through 29 before it even starts working on your new question. A Concordia University research team measured this directly in a multi-agent coding system running on GPT-5 Reasoning, finding that input tokens made up 53.9% of total token consumption across 30 software development tasks. More than half of the budget went to re-consuming context rather than generating new output. The exact ratio will vary across Claude products and use cases, but the mechanism remains the same. This effect is why “start a new chat when the topic changes” is the single most repeated piece of advice in every article on this subject. The advice is not about organization, but about economics. Token Economics Principle 2: The Context Window Is a Shared Container with Inputs You Cannot See You think of your prompt as what you type. Claude sees something much larger. Every file Claude reads during a session stays in context for the rest of that session. Every tool output, every connector response, every Search result, every artifact you generated three turns ago, the system prompt you never wrote, the CLAUDE.md or Project instructions you uploaded once and forgot about, the Memory feature’s silent additions, and the entire message history. All of it shares the same finite window. Most of it is invisible to you in the interface. Jenny Ouyang, who writes about Claude Code after receiving a $1,600 API bill in two months, ranks tool call outputs as the single largest drain on token budgets. She puts them above the conversation length. A 10,000-line log file Claude reads early in a session stays in context for every subsequent message. On Claude.ai, the equivalent is a large PDF you uploaded to a chat. Anthropic’s own token-counting documentation shows a 51-page PDF (a Tesla quarterly SEC filing used as an example) counted at roughly 119,000 tokens, or about 2,300 tokens per page. A standard JPEG image runs around 1,550 tokens for a typical photograph. Upload the same 15-page PDF into four different chats because you forgot you already did, and you have paid for it four times. Paweł Huryn, who built an open-source dashboard that reads Claude Code transcripts locally, writes that /usage does not break down tokens by model, project, or session. You hit a limit and have no direct way to see what caused it. Huryn’s dashboard showed a single-day spike of 700 million cached tokens on his account, which turned out to be an Anthropic bug, not his usage. Without the dashboard, he would not have noticed. What Claude writes also counts. Verbose responses, extended thinking output, generated artifacts, and the results of Research sessions all consume the budget on the way out, and then become part of the conversation history that gets re-read on the next turn. Output tokens are billed at exactly five times the rate of input tokens on Anthropic’s current API across Opus, Sonnet, and Haiku; on a subscription, that cost hides inside the usage meter, but the mechanism is the same: You pay for a 2,000-word response you did not actually need several times over, once when it is written, and once on every subsequent turn that re-reads it. That is the condition your audience is working in. The container is shared, most of its contents are invisible, and the tools to inspect it exist only for API users who are willing to build them. Pro and Max subscribers fly blind. Token Economics Principle 3: Stable Context Is Cheap; Changing Context Is Expensive Anthropic’s caching system gives a large discount to contexts that stay identical across requests. Cache reads cost roughly 10 percent of the base input token price. Cache writes cost 25 percent more than base input, paid once and amortized across every subsequent hit. The default cache lifetime is five minutes, extensible to one hour at additional cost. The caching hierarchy processes requests in a fixed order: tools, then system prompt, then message history. A change early in the order invalidates everything after it. Rearrange your system prompt, add a new MCP server, upload a new file to your Project, and the cached prefix breaks. The next request rebuilds the cache from the first changed byte onward, at full cost. This mechanism explains why the same task can cost differently on two different days. You stepped away for 30 minutes for a coffee. The cache expired. Your next message rebuilt the entire context at write cost instead of read cost. Piunikaweb reports that Anthropic’s Thariq Shihipar attributed some of the extreme session-drain cases users reported in late March to “expensive prompt cache misses” when resuming long conversations with large context windows. On Claude.ai specifically, you cannot place cache breakpoints yourself. What you can do is behave in ways that make caching work: Keep your persistent context (Project instructions, about-me files, CLAUDE.md) short and stable.Do not reorder files you upload.Do not take long breaks in the middle of heavy work.Finish a session before it drifts off-task. Claude Projects deserve a separate note because most articles get them wrong. On paid plans, Projects use retrieval-augmented generation (RAG), but only when your uploaded knowledge “approaches or exceeds” the context window limit, which sits around 200,000 tokens. Anthropic does not publish the exact trigger point, and it may shift. Below that threshold, every file in the Project loads into context on every single prompt. Above it, Claude retrieves only the relevant chunks, and a visual indicator appears in the interface. The practical consequence: if you sit below the threshold, fewer and shorter Project files are strictly better, because you are paying for all of them on every turn. If you sit above it, you can add more material without linear cost. The advice you see that treats Projects as automatic efficiency magic is wrong for most Pro users, whose Projects contain a few style guides and reference docs and live well under the threshold. The worst place to sit is just below the threshold. A Project near the 200,000-token line pays the full cost of every file on every prompt, without the retrieval efficiency that kicks in when RAG activates. Call this the Valley of Death. If you find yourself there, you have three reasonable moves: Trim the Project aggressively, down to a quarter of the threshold, so the per-prompt cost is contained. Trim is right when most of your work uses a small, stable set of references.Pad the Project with genuinely useful reference material to cross the threshold and trigger RAG mode. Pad is right when you have a genuinely large knowledge base, Claude needs to draw from across sessions.Often the best move: partition. Split one bloated Project into several task-specific ones. A marketing Project carrying 180,000 tokens of brand voice, social copy guidelines, and competitor research is really three Projects pretending to be one. Break them apart, and you stay far below the threshold in each, and Claude stops re-reading competitor research every time you draft a tweet. Partition is right when the content in the Project serves different tasks that rarely need each other. What is not defensible is leaving the Project to drift at the threshold line, paying the maximum cost for the minimum efficiency. Token Economics Principle 4: Scarcity Is Structural, Not Cyclical Flat-rate generosity was the marketing of a land-grab phase. It was never the steady state. Tomasz Tunguz, a venture capitalist writing about AI infrastructure, calls what is happening now “the beginning of scarcity in AI.” He names five hallmarks: Relationship-based selling (SOTA models gated to privileged customers)AI to the highest bidderAvailable-but-slow accessInflationary pricingForced diversification toward smaller or self-hosted models Quote: “The age of abundant AI is over, and it will remain so for years.” The PYMNTS coverage of the same period describes it as “AI rationing” and notes that Google, Anthropic, and others are simultaneously publishing explicit daily prompt caps where vague access language used to stand. Anthropic’s April 4 lockout of third-party subscription routing fits the pattern: subscription access is being actively defended as a retail product, and arbitrage through automation tooling is being closed off. Your Pro subscription in April 2026 is not the same product as your Pro subscription in January 2026. The marketing copy is the same, but the economic reality underneath it has shifted. If your work with Claude was built on the January assumption, it is now running on borrowed time. That reality changes the question the user should be asking. The old question was “how do I save tokens?” That question treats tokens as a cost to minimize. The more useful question is “what is the return on intelligence per token?” Every token you spend should buy intelligence worth paying for. Fifty thousand tokens to draft a routine email that a template would have produced is economically illiterate, regardless of whether you hit your limit. Five thousand tokens to decode a difficult incentive structure before a hard conversation is a high return. The discipline is not “use fewer tokens.” The discipline is knowing what you are buying. In a scarcity regime, that judgment is what separates a professional user from a consumer. The Visibility Problem Scarcity plus opacity produces anxiety, and that is the condition your audience is working in. Pro and Max subscribers have no per-prompt token breakdown. No real-time usage indicator. No way to know whether a given message hit the cache or missed it. The only signal is the usage meter, which moves in discrete steps and resets on a rolling schedule that varies by time of day. You cannot measure what you cannot see. Classical optimization advice assumes instrumentation: measure first, then optimize the highest-impact areas. That advice is sound for API users with dashboards and for production LLM applications where a team can A/B test prompt variants. It does not apply to a product manager using Cowork on a Pro plan. That user cannot measure. What remains, for them, is informed default behavior. Habits of mind that keep the usage meter below the waterline without requiring instruments. Which brings me back to my favorite claim: This is not optimization for token economics in the engineering sense, but human judgment. The Counter-Argument and Why It Is Partly Right A note before the counter-argument: setting context for an LLM is not planning. It is handing the model the substrate it needs to reason. The iteration you do inside that substrate is still iterative, still wicked, still agile. Context engineering scopes the problem; it does not specify the solution. There is a real counter-argument to everything in this article on Token Economics. It goes roughly like this: optimizing for tokens is premature optimization. The real constraint on your work with Claude is the quality of your thinking, not the quantity of your tokens. Compress your prompts, and you will confuse the model, get worse answers, and spend more tokens on retries. Quality-first, cost-second. The counter-argument is partly right. Harsh or unclear prompts lead to worse work. A vague 15-word prompt that forces Claude to ask three clarifying questions costs more in total than a precise 60-word prompt that works on the first try. Premature optimization is real, and optimizing against a metric you cannot see is a recipe for false economy. Do not strip a prompt of context in the name of token hygiene if the context was load-bearing. The counter-argument is wrong about one thing: in a scarcity regime, clear thinking and disciplined token use are the same skill, not competing ones. A well-framed problem consumes fewer tokens because clarity is itself compressive. A developer who uses Claude efficiently is not cutting corners. They are demonstrating the exact engineering judgment senior engineers have always demonstrated: understand the problem before articulating it, decompose cleanly, provide the right context and no more, and evaluate output critically. The token count is a side effect of clear thinking. Or, let me rephrase, the thinking is the point of the exercise. This approach matters for your audience because it reframes the question. Token discipline is not a penny-pinching habit imposed by scarcity. It is an observable signal of professional competence with AI, in the same way that scope discipline is a signal of professional competence in Sprint Planning. Judgment as the Professional Response You have seen this pattern before in other domains: flat-rate became metered, generous became gated. The internet went from unlimited to capped. Enterprise software went from site licensing to seat pricing. The pattern always arrived with the same signal: the previous model was subsidizing growth, growth slowed, and the economy had to surface. AI is now there. The response that works is the response that worked in the previous cycles: develop judgment about the resource before you are forced to; learn about Token Economics. Four practices, grouped by principle, deserve to become habits: On Principle 1 (Every Turn Re-Consumes Everything Before It) One topic per chat. Start a new conversation when the subject changes. At the end of a substantive session, ask Claude to write a short notes file covering decisions and next steps. Start the next session by loading that file. You carry forward exactly what matters and leave the rest behind. (Of course, you can write a skill for that job as I did.) On Principle 2 (Hidden Inputs Share the Container) Do not load the context Claude does not need. Select only the Project files relevant to the current task. Turn off Search, connectors, and extended thinking when you do not need them. Convert PDFs and screenshots to plain text before uploading, where possible. If you read your own files through Claude for your own eyes, use a script or the file system directly. Claude does not need to be in the loop for reading. For long outputs, use Skeleton-of-Thought: ask Claude for the outline and key data points first, review it, then expand only the sections you actually need, ideally in a new, clean chat. This way treats the token budget as a surgical tool rather than a firehose, and it costs far less than asking for a 2,000-word report that you then have to read, correct, and discard in part. When you do need short output, constrain it explicitly: “top three bullet points, no commentary,” “the table only, no preamble.” Claude defaults to thorough; however, thoroughness has a price, and you pay it twice, once when the response is generated and again every time it is re-read on subsequent turns. On Principle 3 (Stable Context Is Cheap) Keep your Project instructions and persistent context short. If your about-me file or Project instructions have grown into thousands of words over time, trim them aggressively. The cost of carrying that weight is paid on every single prompt. Do not reorder or re-upload files mid-session. Finish heavy work in one sitting rather than across a two-hour gap that kills the cache. On Principle 4 (Scarcity Is Structural) Default to Haiku, escalate on demand. Run logic and structure checks through Haiku first, where speed and quota concerns barely register. Once the approach is sound, move the refined prompt to Sonnet for most daily work, and reserve Opus for the cases where Sonnet has visibly failed, or the reasoning is genuinely hard. Starting every session with Opus is a 2024 luxury that does not survive the 2026 peak-hour regime. Plan in Chat, execute in Cowork, or artifacts, because the expensive surfaces should do only the work that needs them. If you run heavy automation on a schedule, move it to off-peak hours. The premium you pay for a Max subscription can easily be repaid by a single week of not hitting limits on Pro. Notice what these suggestions are not: they are not hacks. They are not a checklist to cross off before Monday. They are the professional defaults of someone who has internalized that the machine they are working with has a finite, invisible, and actively shrinking budget, and who has decided to work within that reality instead of against it. Judgment is a human thing; the tool is neutral, but your competence with the tool is not. Token Economics: Conclusion Pick one Token Economics practice from this article and install it this week. My suggestion: the end-of-session notes file. At the end of your next real working session with Cowork or Chat, ask Claude to summarize what you decided, what is unresolved, and what the next step is. Save the output. Start your next session with it. The practice takes ninety seconds per session and breaks the single most expensive habit in Principle 1: carrying an entire sprawling conversation into tomorrow because it is there. Do that for a month. Then come back and tell me whether the meter behaves differently. By the way, this practice is perfectly suited for creating a skill. I did so a month ago.
In this article, I am sharing what I learned while integrating a RAG-based application with LangSmith. It covers how the integration works and the key insights gained from using LangSmith for observability and evaluation. LangChain LangChain is a framework for building applications powered by large language models in a more structured and modular way. It helps developers connect LLMs with prompts, tools, memory, agents, and external data sources to create more capable applications. In simple terms, LangChain makes it easier to design, manage, and scale complex AI workflows. LangSmith What I liked about LangSmith is that it made the internal flow of my application much easier to understand, especially when I wanted to see how each step in the workflow was behaving. It gives clear visibility into prompts, model responses, chains, and agent flows, making it much easier to trace issues, debug failures, and see how an application behaves in real usage. In simple terms, LangSmith acts like an observability layer for AI apps, helping teams monitor performance, improve reliability, and build with more confidence. What also makes LangSmith especially useful is its support for dataset creation and experimentation. Teams can build and manage datasets from real use cases, then use them to test prompts, compare model responses, and evaluate how an application performs across different scenarios. This makes improvement more systematic, because instead of relying only on trial and error, developers can measure changes and make decisions with much more confidence. What I Evaluated As part of the evaluation, I integrated a RAG application with LangSmith to gain better visibility into how the system behaves at each step. I also explored how LangSmith presents tracing information and what kind of details it displays for each run. Getting Started With LangSmith To begin using LangSmith, sign up for an account on the LangSmith portal. Once registered, you can access the platform and start setting up your workspace for observability, dataset creation, and experimentation. Create an application in LangSmith to keep traces, datasets, and experiments organized in a more structured and manageable way. Create a project in LangSmith so that all traces related to a particular application or workflow are grouped in one place. I found this especially useful because it kept the tracing data organized and made it easier to focus on one workflow without mixing it with traces from other projects. Generate an API key for the LangSmith application, since it is required in the application code to connect with the platform. The API key can be created from the Settings section, which is available through the link at the bottom of the LangSmith portal. Configure the required LangSmith environment variables in your application by setting LANGSMITH_TRACING_V2, LANGSMITH_API_KEY, and LANGSMITH_PROJECT. This allows the application to enable tracing, authenticate with LangSmith, and associate all observability data with the correct project. Viewing Trace Data After configuring the required environment variables, run the agent workflow so that tracing data is sent to LangSmith. Once the workflow executes, you can open the newly created project and view the trace details captured for each run, including insights such as latency, token usage, and cost. If you want to focus only on LLM interactions, you can change the default view to LLM Calls from the drop-down menu. Click on a trace name to view the detailed execution information captured for that run. This includes feedback, inputs, outputs, and additional attributes, helping you understand how the workflow executed and how the model responded at each step. Attributes and Runtime Metadata The attributes section provides useful metadata about each run, including details such as the provider, model name, temperature, and LangChain library version. It also captures environment information like the platform version, runtime, and runtime version, which helps provide deeper technical context for every trace. Threads View If the application is thread based, the user can explore its activity through the Threads section in LangSmith. Each thread can be viewed separately, which makes it easier to follow the flow of individual conversations or interactions in a more organized way. This is especially useful when the application handles multiple sessions, because it helps isolate the history, behavior, and responses of each thread without confusion. One of the key advantages is that it gives better clarity during debugging and analysis, making it simpler to understand how a specific conversation progressed over time. I found the Threads view particularly helpful when looking at conversation level behavior. Cost Breakdown Developers can also view the cost breakdown, which shows the number of tokens used for both input and output. This provides a clearer understanding of how much each run consumes and how that usage affects the overall cost. A key advantage of this feature is that it helps identify expensive prompts, lengthy responses, or inefficient workflows that may be increasing usage unnecessarily. With this level of visibility, teams can make better optimization decisions to control cost without compromising application quality. The cost breakdown was one of the most practical features for me because it connected model activity directly with usage Conclusion Integrating a RAG or agent based application with LangSmith makes it much easier to observe, understand, and improve how the system behaves in real world usage. From tracing workflow execution to analyzing details such as token usage, latency, and overall performance, LangSmith provides a structured way to monitor and evaluate application behavior. Overall, integrating LangSmith with my RAG application gave me a much clearer view of what was happening behind the scenes. More than just tracing requests, it helped me understand performance, token usage, and the behavior of each run in a way that felt practical and immediately useful. Disclaimer The views expressed in this article are my own and do not necessarily reflect the views of my employer.
If you have been building anything non-trivial with Genkit, you have probably bumped into the same set of cross-cutting concerns over and over again: retrying transient model errors, falling back to a cheaper model when quota explodes, gating tool execution behind human approval, injecting filesystem access for coding agents, logging every request and response for observability... Until now, you ended up either wrapping ai.generate() calls by hand or writing ad-hoc helpers that ended up duplicated across flows. The new Genkit Middleware changes that. It introduces a first-class, composable middleware layer for the generate() pipeline, with hooks for the model, the tool execution, and the high-level generation loop, plus a small but very useful set of official middlewares published in the brand new @genkit-ai/middleware package. This article is a practical tour of what the new middleware system gives you, the built-in middlewares you can drop in today, and how to write your own with generateMiddleware. The official documentation lives at Genkit Middleware. All examples below assume the JavaScript/TypeScript SDK. A quick reminder: although this article focuses on the JS/TS middleware API, Genkit is a multi-language framework. The official SDKs cover JavaScript/TypeScript (primary, stable), Go, Python (preview) and Dart/Flutter (preview), and there is a community-maintained Java SDK used in production. The middleware concepts described here are JS/TS-specific today, but the underlying generate() pipeline exists across all SDKs and the same patterns are landing on the other runtimes. What Is Middleware in Genkit? Conceptually, Genkit middleware behaves like the middleware you already know from Express or Koa, only applied to the LLM lifecycle instead of HTTP requests: A generate() call is intercepted before it reaches the model.Each middleware can inspect or modify the request, decide whether to call next(), and inspect or modify the response on the way back.Multiple middlewares can be chained. They run in the order they are declared and unwind in reverse order, exactly like an onion. What makes Genkit's design interesting is that it does not give you a single chokepoint; it gives you three orthogonal interception phases: model – wraps the call to the underlying model. Perfect for retries, fallbacks, request/response logging, or response transformations.tool – wraps tool execution. Ideal for approvals, sandboxing, audit logs, or input/output validation.generate – wraps the whole high-level generation loop (prompting, tool calling, output parsing). Best for things like injecting tools or system instructions before the loop starts. You opt in per call via a use: array, which keeps things explicit and avoids global side effects: JavaScript const response = await ai.generate({ model: googleAI.model('gemini-flash-latest'), prompt: 'Hello', use: [retry({ maxRetries: 3 }), loggerMiddleware({ verbose: true })], }); Installation The official middlewares ship in their own package, decoupled from the Genkit core: Shell npm install @genkit-ai/middleware # or pnpm add @genkit-ai/middleware You still need genkit itself and a model provider plugin (for example @genkit-ai/google-genai). The Built-In Middleware Catalog Let's go through the five middlewares the Genkit team ships out of the box. filesystem: Give the Model a Sandboxed File System filesystem injects a standard set of file manipulation tools (list_files, read_file, write_file, search_and_replace) into the generation loop, restricted to a root directory of your choice. JavaScript import { genkit } from 'genkit'; import { googleAI } from '@genkit-ai/google-genai'; import { filesystem } from '@genkit-ai/middleware'; const ai = genkit({ plugins: [googleAI()] }); const response = await ai.generate({ model: googleAI.model('gemini-flash-latest'), prompt: 'Create a hello world Node app in the workspace', use: [ filesystem({ rootDirectory: './workspace', allowWriteAccess: true, }), ], Useful options: rootDirectory (required) – sandbox root, all paths are confined to it.allowWriteAccess – defaults to false. Read-only by default is a sane choice for safety.toolNamePrefix – namespace the injected tools to avoid collisions with your own. This is essentially the building block for a "coding agent" pattern, without you having to write tool definitions or path validation logic. skills: Auto-Load Markdown Skills as System Context skills scans a directory for SKILL.md files (plus their YAML frontmatter), injects relevant ones into the system prompt, and exposes a use_skill tool the model can call when it needs more specific guidance. JavaScript import { skills } from '@genkit-ai/middleware'; const response = await ai.generate({ prompt: 'How do I run tests in this repo?', use: [skills({ skillPaths: ['./skills'] })], Think of it as a lightweight, file-based knowledge layer: every skill is a self-contained Markdown file with metadata, and the middleware decides when to surface them. It is a really clean alternative to ad-hoc system prompt soup. toolApproval: Human-in-the-loop for Tool Calls toolApproval enforces an allowlist of tools the model is allowed to execute autonomously. Anything outside the list raises a ToolInterruptError, so you can pause execution, ask the user, and resume. JavaScript import { genkit, restartTool } from 'genkit'; import { toolApproval } from '@genkit-ai/middleware'; const response = await ai.generate({ prompt: 'write a file', tools: [writeFileTool], use: [toolApproval({ approved: [] })], // empty list -> always interrupt }); if (response.finishReason === 'interrupted') { const interrupt = response.interrupts[0]; // ... ask the user, then mark the tool call as approved const approvedPart = restartTool(interrupt, { toolApproved: true }); const resumedResponse = await ai.generate({ messages: response.messages, resume: { restart: [approvedPart] }, use: [toolApproval({ approved: [] })], }); } This is exactly the pattern you want for any agent that touches the real world (filesystem writes, payments, sending emails). No more home-grown approval flags scattered across the codebase. retry: Exponential Backoff With Jitter for Transient Errors The retry middleware retries failed model calls on transient status codes (UNAVAILABLE, DEADLINE_EXCEEDED, RESOURCE_EXHAUSTED, ABORTED, INTERNAL) using exponential backoff with jitter. JavaScript import { retry } from '@genkit-ai/middleware'; const response = await ai.generate({ model: googleAI.model('gemini-pro-latest'), prompt: 'Heavy reasoning task...', use: [ retry({ maxRetries: 3, initialDelayMs: 1000, backoffFactor: 2, }), ], }); Knobs you actually care about: maxRetries (default 3)statuses — which status codes to retry oninitialDelayMs / maxDelayMs / backoffFactornoJitter — if you really want deterministic delays This is one of those things every team writes once, badly. Having it in the framework is a very welcome change. fallback: Gracefully Degrade to a Different Model fallback switches to an alternate model when the primary one fails on configurable status codes. The classic use case is "try Pro first, fall back to Flash when quota is exhausted": JavaScript import { fallback } from '@genkit-ai/middleware'; const response = await ai.generate({ model: googleAI.model('gemini-pro-latest'), prompt: 'Try the pro model first...', use: [ fallback({ models: [googleAI.model('gemini-flash-latest')], statuses: ['RESOURCE_EXHAUSTED'], }), ], }); You can chain multiple fallback models, and isolateConfig lets you decide whether the fallback inherits the original request configuration or starts clean (handy when the fallback model does not support the same options as the primary). Building Your Own Middleware With generateMiddleware The same primitive that powers all the built-ins is exposed for you. The generateMiddleware helper gives you typed config schemas (via Zod) and access to the ai instance. Here is the canonical "logger" example, straight from the docs but lightly annotated: JavaScript import { generateMiddleware, z } from 'genkit'; export const loggerMiddleware = generateMiddleware( { name: 'loggerMiddleware', description: 'Logs requests and responses', configSchema: z.object({ verbose: z.boolean().optional(), }), }, ({ config, ai }) => { return { // Phase 1: intercept the model call model: async (req, ctx, next) => { if (config?.verbose) { console.log('Request:', JSON.stringify(req)); } const resp = await next(req, ctx); if (config?.verbose) { console.log('Response:', JSON.stringify(resp)); } return resp; }, // You could also add `tool: ...` and `generate: ...` hooks here. }; } Using it is identical to the official ones: JavaScript const response = await ai.generate({ model: googleAI.model('gemini-flash-latest'), prompt: 'Hello', use: [loggerMiddleware({ verbose: true })], }); A few patterns I have found very useful: PII redaction – implement a model hook that scrubs the request prompt and the response text against a regex/dictionary, returning the cleaned version.Cost accounting – wrap the model hook to read usage tokens from the response, and emit them to your metrics backend tagged by user/feature.Per-tenant quotas – use the generate hook to check a counter (Redis, Firestore...) before calling next(); throw your own custom error if the tenant is over quota.Caching – keyed on a hash of the model + request, return a cached response if hit, otherwise call next() and persist the result. For more inspiration, the source of the official middlewares is open in the Genkit GitHub repository, and reading them is genuinely educational. Composition: Stacking Middlewares Middlewares compose in array order. A reasonable production stack might look like this: Python const response = await ai.generate({ model: googleAI.model('gemini-pro-latest'), prompt: userPrompt, tools: myTools, use: [ loggerMiddleware({ verbose: false }), // outermost: see everything retry({ maxRetries: 3 }), // recover from transient failures fallback({ // degrade if Pro is overloaded models: [googleAI.model('gemini-flash-latest')], statuses: ['RESOURCE_EXHAUSTED'], }), toolApproval({ approved: ['searchDocs'] }), // gate dangerous tools ], }); The order matters: outer middlewares see the result of the inner ones. Put logging on the outside if you want it to record the final state after retries and fallbacks; put it on the inside if you want to see every individual model attempt. The Importance of Middleware for Production Agents Genkit Middleware is one of those features that does not look flashy in a changelog but quietly fixes a lot of real-world friction. It pushes Genkit closer to a "batteries-included" framework for production agents: Cross-cutting concerns are no longer copied and pasted across flows.Safety-critical behavior (approvals, sandboxes, fallbacks) is declarative.The model / tool / generate split gives you precise control without forcing you to monkey-patch.The middleware contract is small enough that the community can ship plugins that interoperate. If you maintain any non-trivial Genkit application, the upgrade is a no-brainer. Drop in retry and fallback first, you will probably see incidents disappear within the week. Then start writing your own middlewares for the things that are unique to your domain. Conclusion Middleware turns Genkit's generate() from "a function you call" into "a pipeline you compose". The official @genkit-ai/middleware package covers the most common production needs (filesystem access, skills, tool approval, retries, fallbacks), and generateMiddleware makes writing your own a 20-line affair instead of a refactor. For the next steps, take a look at: Genkit Middleware documentationGenkit middleware source on GitHubGenkit flows — middleware composes especially well with typed flowsTool calling and Interrupts — the foundation that toolApproval builds on Happy hacking, and may your fallback models always be cheaper than your primary one.
If you've been watching the open-source LLM space, you've probably noticed it's been a great couple of years. Llama, Mistral, Phi, Qwen — a whole zoo of models you can download and run on your own machine. Google's entry into that zoo is Gemma, and the fourth generation, Gemma 4 (released April 2, 2026), is the biggest leap yet: built from Gemini 3 research, multimodal (text + image + video + audio), 256K context, native function calling, configurable "thinking mode," and — finally — a clean Apache 2.0 license. In this post, we're going to: Understand what Gemma 4 actually is, with an architecture diagramGet it running on your laptop with Ollama in about 5 minutesChat with it from the terminalSend it an image and ask questions about itTurn on thinking mode for harder problemsCall it from a Python script like a real APIBuild a small project that glues it all together No GPU rental, no API keys, no telemetry. Let's go. Heads up: This guide assumes zero ML background. If you can install software and run a terminal command, you can do this. What Is Gemma 4? Gemma is Google DeepMind's family of open-weight language models. "Open-weight" means the actual neural network weights — the giant matrices of numbers that make the model work — are freely downloadable. You can run them, modify them, fine-tune them, and ship them in your product. Gemma 4 brings several big changes over Gemma 3: Apache 2.0 license. Earlier Gemma releases used a custom license with a Prohibited Use Policy that made some enterprise legal teams nervous. Gemma 4 is plain Apache 2.0 — unlimited commercial use, no MAU caps, no special permissions. This alone is a big deal for production deployments.Mixture-of-Experts. A new 26B MoE variant activates only ~4B parameters per token, giving you 13B-class quality at 4B-class cost.Thinking mode. A configurable reasoning mode where the model thinks step-by-step before answering. Toggle it on for hard problems, off for fast chat.Native function calling. Built-in support for structured tool use — write an agent without needing prompt engineering hacks.More modalities. Image, video frames, and (on the smaller E2B/E4B models) native audio input. Native system prompt support, too.Bigger context. 128K on the small models, 256K on the larger ones. Model Sizes at a Glance ModelDisk (Ollama)Active paramsTotal paramsMultimodalContextBest forE2B~7.2 GB~2B~2.3Btext + image + audio128KPhones, edge devices, browserE4B~9.6 GB~4B~4.5Btext + image + audio128KMost laptops — the sweet spot26B A4B (MoE)~18 GB~4B26Btext + image256KConsumer GPUs, agentic workloads31B Dense~20 GB31B31Btext + image256KWorkstations, highest-quality answers Two naming notes worth understanding: E2B / E4B. The "E" stands for Effective parameters. These are dense edge-first models that use a trick called Per-Layer Embeddings (PLE — more on this below) to do more with fewer active parameters.26B A4B. This is the Mixture-of-Experts model. 26B parameters total, but only ~4B "activate" per forward pass. Latency and cost behave like a 4B model; quality is closer to a 13B dense model. Caveat: you still need to load all 26B into memory. For most readers on a laptop, E4B is the right starting point. It runs comfortably on a 16 GB Mac or any modern dev machine. Gemma 4 vs. the Rest of the Open-Model Zoo (May 2026) ModelSizesMultimodalContextLicenseGemma 4E2B / E4B / 26B MoE / 31Btext + image + video + audio (small)128K / 256KApache 2.0Llama 4varioustext + image128K+Llama community licenseQwen 3.5varioustext + image128K+Apache 2.0DeepSeek V4 FlashMoEtext128KMIT Gemma 4's pitch: the only family that spans phones to servers under Apache 2.0, with multimodal and audio in the same release. The Architecture (in Plain English) You don't need this section to use Gemma 4 — feel free to skip to the install steps. But if you've ever wondered what's actually happening when a multimodal model "sees" and "hears," here it is. A few pieces worth understanding: Three input paths. Text goes through a SentencePiece tokenizer (shared with Gemini). Images go through a vision encoder that handles variable aspect ratios and resolutions natively (no more square-only inputs like Gemma 3). On the E2B and E4B models, audio goes through a USM-style conformer encoder borrowed from Gemma 3n. All three paths produce tokens that get interleaved in a single stream — so you can freely mix text, images, and audio in any order in one prompt.Alternating local/global attention. Most layers only look at a sliding window of recent tokens (cheap). A subset of layers attends to the full context (expensive but rare). This is the standard trick for keeping the KV cache from blowing up at 256K context.Per-Layer Embeddings (PLE) — the small-model secret. In a normal transformer, each token gets one embedding vector at input, and that's all the residual stream has to work with. PLE adds a parallel pathway: for each token, every layer gets its own small conditioning vector from a lookup table. The embedding tables are large (lots of memory), but the "active" parameters per token stay small — that's why a 4-billion-active-parameter E4B can punch above its weight.Mixture-of-Experts (26B A4B). The MoE layer has multiple "expert" feed-forward networks. A small router picks 2 of 8 (or similar) for each token. Total params = 26B (all loaded), active params per token = ~4B (only those fire). Pareto-optimal for quality-per-FLOP.Thinking mode. When you include the special <|think|> token at the start of the system prompt, the model emits internal reasoning between <|channel>thought\n...<channel|> markers before the final answer. Disable it for fast chat; enable it for math, code, and multi-step reasoning. That's most of what's worth knowing. Now let's actually run it. Step 1: Install Ollama There are a few ways to run Gemma 4 locally, but the easiest by a mile is Ollama. Think of it as "Docker for LLMs" — it handles downloading the model, managing memory, GPU acceleration, and exposing a local API. You don't have to think about CUDA versions or PyTorch. Install it: macOS / Windows: Download the installer at ollama.com/download and run it.Linux: Shell curl -fsSL https://ollama.com/install.sh | sh Verify: Shell ollama --version You should see a version number. Gemma 4 requires Ollama v0.20.0 or later — if you're on an older version, update first. Step 2: Pull a Gemma 4 Model Download the default (E4B, ~9.6 GB): Shell ollama pull gemma4 This downloads about 9.6 GB. Grab a coffee. Other sizes, if you want them: Shell ollama pull gemma4:e2b # ~7.2 GB — smallest, for low-RAM machines ollama pull gemma4:e4b # ~9.6 GB — the default; same as `gemma4` ollama pull gemma4:26b # ~18 GB — the MoE; 256K context ollama pull gemma4:31b # ~20 GB — biggest dense model Hardware reality check: On Apple Silicon, 16 GB unified memory handles E4B comfortably. NVIDIA users need the model to fit entirely in VRAM for GPU-accelerated inference. The 26B model fits on 24 GB but leaves very little headroom — treat it as the ceiling, not the target. List what you've got: Shell ollama list Step 3: Chat With It in the Terminal Easiest possible test: Shell ollama run gemma4 You'll get an interactive prompt: Plain Text >>> Explain what a hash map is, like I'm a junior dev. Hit enter and watch it stream a response. To exit, type /bye. That's it. You're running a state-of-the-art LLM locally with zero cloud dependency. Try: "Write a Python function that finds duplicates in a list, with three different approaches and their tradeoffs.""What's the difference between TCP and UDP? Use an analogy.""Translate 'Where is the nearest train station?' into Japanese, Spanish, and Hindi." Step 4: Send It an Image Gemma 4 can see. Drop any image file in your current directory, then: Shell ollama run gemma4 >>> Describe what's in this image: ./screenshot.png Ollama loads the image, sends it through the vision encoder, and the model answers. Unlike Gemma 3 (which resized everything to 896×896), Gemma 4 handles variable aspect ratios and resolutions natively — so tall screenshots, wide diagrams, and high-res photos all work without manual cropping. Try: "What error is shown in this screenshot?" (paste a stack trace)"What's the bounding box for the 'submit' button in this UI?" (Gemma 4 will answer in JSON — natively!)"Read the handwriting in this note and transcribe it." Step 5: Turn on Thinking Mode For harder problems — multi-step math, complex code, logic puzzles — turn on thinking mode. Include the <|think|> token at the very start of your system prompt: Shell ollama run gemma4 >>> /set system "<|think|>You are a careful, methodical assistant." >>> Three friends split a $73.42 dinner bill. Alice had a $12 appetizer, Bob had a $9 drink. The rest is shared. What does everyone pay? The model will emit its reasoning in a <|channel>thought\n...<channel|> block before the final answer. For fast chat, leave the token out, and the model answers directly. When to use it: Code generation, math, multi-hop reasoning, agentic planning — yes. Single-turn factual questions, summarization, translation — no, it just adds latency. Step 6: Call Gemma 4 From Python A chat prompt is nice, but you're a developer — you want to call this thing from code. When Ollama is running, it exposes a local REST API on http://localhost:11434. There's also an official Python client. Install it: Shell pip install ollama Basic Chat Shell import ollama response = ollama.chat( model="gemma4", messages=[ {"role": "system", "content": "You are a senior code reviewer. Be concise and direct."}, {"role": "user", "content": "Review this code:\n\ndef add(a, b):\n return a+b"}, ], ) print(response["message"]["content"]) Streaming Responses (ChatGPT-Style) Shell import ollama stream = ollama.chat( model="gemma4", messages=[{"role": "user", "content": "Write a haiku about debugging."}], stream=True, ) for chunk in stream: print(chunk["message"]["content"], end="", flush=True) Sending an Image Shell import ollama response = ollama.chat( model="gemma4", messages=[{ "role": "user", "content": "What's in this image?", "images": ["./my_photo.jpg"], }], ) print(response["message"]["content"]) Thinking Mode + Function Calling (the Agentic Combo) This is where Gemma 4 actually starts feeling like a "real" agent. You declare your tools as JSON schemas, the model decides when to call them, and you execute the call and pass results back. No prompt engineering hacks needed. Shell import ollama tools = [{ "type": "function", "function": { "name": "get_weather", "description": "Get the current weather for a city.", "parameters": { "type": "object", "properties": { "city": {"type": "string", "description": "City name, e.g. 'Tokyo'"}, }, "required": ["city"], }, }, }] def get_weather(city: str) -> str: # Pretend this hits a real API. return f"{city}: 22°C, partly cloudy" response = ollama.chat( model="gemma4", messages=[ {"role": "system", "content": "<|think|>You are a helpful weather assistant."}, {"role": "user", "content": "Should I bring an umbrella in Tokyo today?"}, ], tools=tools, ) # If the model wants to call a tool, execute it and feed the result back: for tool_call in response["message"].get("tool_calls", []): name = tool_call["function"]["name"] args = tool_call["function"]["arguments"] if name == "get_weather": result = get_weather(**args) # Send result back for the model to finalize its answer followup = ollama.chat( model="gemma4", messages=[ {"role": "user", "content": "Should I bring an umbrella in Tokyo today?"}, response["message"], {"role": "tool", "content": result, "name": name}, ], ) print(followup["message"]["content"]) Raw HTTP (No Python Client Needed) For any other language: Shell curl http://localhost:11434/api/chat -d '{ "model": "gemma4", "messages": [{"role": "user", "content": "Hello!"}], "stream": false }' Same JSON shape works from Node, Go, Rust, your shell — anything that can make an HTTP request. A Small Project: Folder-Watching Image Describer Here's a useful ~30-line script. It watches a folder, and any new image dropped in gets automatically described by Gemma 4. Great for accessibility tools, content moderation prototypes, or just learning. Python import os, time import ollama WATCH_DIR = "./inbox" os.makedirs(WATCH_DIR, exist_ok=True) SEEN = set(os.listdir(WATCH_DIR)) print(f"Watching {WATCH_DIR}/ — drop an image in to describe it.") print(" (Ctrl+C to stop)\n") IMAGE_EXTS = (".png", ".jpg", ".jpeg", ".webp", ".gif") try: while True: current = set(os.listdir(WATCH_DIR)) new_files = sorted(current - SEEN) for filename in new_files: if not filename.lower().endswith(IMAGE_EXTS): continue path = os.path.join(WATCH_DIR, filename) print(f" New image: {filename}") response = ollama.chat( model="gemma4", messages=[{ "role": "user", "content": ( "Describe this image in 2-3 sentences. " "Mention any visible text. Be specific." ), "images": [path], }], ) print(f" → {response['message']['content']}\n") SEEN = current time.sleep(2) except KeyboardInterrupt: print("\n Stopped.") Run it, drag images into the inbox/ folder, and watch descriptions appear. That's a real, useful, completely local AI tool — written in 30 lines. Things to Know Before Shipping Anything Serious A few honest caveats: CaveatWhy it mattersHallucinationLocal models still confidently make things up. Don't trust factual claims without verification. Thinking mode reduces this for reasoning tasks, but doesn't eliminate it.CPU latencyExpect 1–3 tokens/sec on a CPU-only laptop with E4B. A GPU gives 3–10× speedup.Context costs RAM256K context is real, but actually filling it eats memory. Most use cases need <16K tokens.MoE memoryThe 26B MoE runs fast (only 4B active per token), but you still need to load all 26B into RAM. Don't confuse active params with memory footprint.Audio is small-model onlyE2B/E4B have native audio input. The 26B and 31B models do not.Apache 2.0 ≠ no responsibilitiesThe license is permissive, but you're still on the hook for safety, bias, and compliance in whatever you ship. References and Further Reading Gemma 4 announcement — Google blog – The launch post (April 2, 2026).Gemma 4 model overview — Google AI for Developers – Official docs: sizes, capabilities, hardware requirements.Welcome Gemma 4 — Hugging Face blog – Best technical write-up: covers PLE, MoE, USM audio encoder, benchmarks, and code samples.Gemma 4 model card on Hugging Face – E4B instruct model weights and configuration.Gemma 4 Complete Guide 2026 — dev.to – Community guide with architecture details and competitor comparisons.SigLIP (Zhai et al., 2023) – The vision encoder family Gemma's image path builds on.Mixture-of-Experts (Shazeer et al., 2017) – The original sparsely-gated MoE paper. The 26B A4B is a direct descendant.Switch Transformer (Fedus et al., 2021) – Modern MoE techniques.Llama 4 – Meta's competing open-weight family.
In this article, we will dive deep into actors, nonisolated methods, @MainActor and @GlobalActors, and the concept of actor reentrancy. We will also explore what happens behind the scenes in the Swift concurrency runtime, including jobs, executors, workers, and schedulers, so you can understand not just how to use these tools, but why they work the way they do. Whether you’re already using Swift’s async/await features or just starting to explore concurrency, this guide will give you a solid understanding of the mechanisms that keep your concurrent code safe and efficient. Actors and Isolation in Swift Concurrency If you’ve spent years working with Grand Central Dispatch (GCD), you already know the core problem: shared mutable state. When multiple threads can read and write the same data at the same time, you risk data races: inconsistent reads, lost updates, or crashes that only appear under heavy load. With GCD, we relied on discipline using serial queues or locks. But discipline fails. One forgotten .sync call and your correctness vanishes. Swift concurrency introduces Actors to make data-race freedom a language-level guarantee. Class vs. Struct vs. Actor Type Semantics Thread Safety Mutation Model Struct Value By-copy safe Explicit mutating Class Reference Unsafe by default Shared mutable state Actor Reference Data-race safe Serialized access Actors sit exactly where classes used to be, but with correctness guarantees. Actor Basics An actor is a reference type that protects its mutable state through isolation. Unlike a class, you cannot accidentally touch an actor’s internal state from multiple threads. Swift actor BankStore { private var balance: Int = 0 func deposit(_ amount: Int) { balance += amount } func withdraw(_ amount: Int) -> Bool { guard balance >= amount else { return false } balance -= amount return true } Key properties of actors: Reference semanticsOnly one task at a time can access actor-isolated stateExternal access requires await nonisolated: Opting Out of Isolation Sometimes you need functionality that doesn’t touch the actor’s state or needs to be callable synchronously. Use the nonisolated keyword for these “pure” utilities. Swift actor ImageCache { nonisolated static let maxItems = 100 nonisolated func cacheKey(for url: URL) -> String { url.absoluteString } } Rule of thumb: if it reads or writes actor state - it should not be nonisolated. The Actor Model: The Mailbox Mental Model Think of an actor as having a mailbox: Each actor has a queue of pending work.Messages (calls) are enqueued as tasks.The actor processes these one at a time. When you write await store.deposit(50), you aren’t calling a function in the traditional sense. You are sending a message to the actor and suspending your current thread until the actor finishes processing that message. This is why await is mandatory: the actor might be busy with someone else’s request. Working With @MainActor and Other @GlobalActors When building scalable iOS applications, managing shared state across isolated domains like UI components, network layers, and local caches becomes a complex puzzle. Swift simplifies this with @GlobalActor. A global actor is essentially a singleton actor. It allows you to isolate state and operations globally without needing to pass an actor reference around your entire dependency graph. The most famous of these is, of course, the @MainActor. The @MainActor is uniquely tied to the main thread. Anything marked with this attribute is guaranteed to execute on the main thread, making it the bedrock for all UI updates. Swift @MainActor final class FlashcardViewModel: ObservableObject { @Published var currentCard: Card? func loadNextCard() async { // Safe to update UI state directly; we are isolated to the MainActor. self.currentCard = await fetchCard() } } However, the power of global actors isn’t limited to the main thread. You can define your own global actors to serialize access to highly contested shared resources, such as a centralized local database or an aggressive retry policy manager. Swift @globalActor public actor SyncActor { public static let shared = SyncActor() } @SyncActor final class OfflineSyncManager { var pendingMutations: [Mutation] = [] func queue(mutation: Mutation) { pendingMutations.append(mutation) } } By annotating OfflineSyncManager with @SyncActor, you guarantee that all accesses to pendingMutations are serialized on that specific actor’s executor, completely eliminating data races from different parts of your app trying to queue offline changes simultaneously. Actor Reentrancy Explained If you’re coming from the world of Grand Central Dispatch (GCD) and DispatchQueue, actors require a fundamental mental shift. A serial dispatch queue executes tasks strictly one after another. If a task is running, nothing else can run on that queue until it finishes. Swift actors are different: they are reentrant. Reentrancy means that while an actor guarantees mutual exclusion for synchronous code execution (only one thread can be inside the actor at a time), it explicitly allows other tasks to interleave at suspension points. When an actor encounters an await, it suspends the current task. Crucially, it also gives up its lock on the executor. During this suspension, the actor is completely free to pick up and execute other pending tasks. Once the awaited operation finishes, the original task is scheduled to resume on the actor when it’s free again. This design prevents deadlocks. If actors weren’t reentrant, two actors awaiting each other would instantly freeze your application. However, reentrancy introduces its own subtle class of concurrency bugs. The Hidden Risks of Suspending Inside Actor Methods Because the actor unblocks during an await, the state of your actor before the await might not match the state after the await. This is the single biggest trap engineers fall into when adopting Swift concurrency. Imagine implementing a session manager that fetches a fresh authentication token. If multiple requests fail and trigger a token refresh simultaneously, you might accidentally fire off multiple network requests if you don’t account for reentrancy. Swift actor SessionManager { private var cachedToken: String? func getValidToken() async throws -> String { // 1. Check local state if let token = cachedToken { return token } // 2. Suspend! The actor is now free to process other calls to `getValidToken()` let freshToken = try await performNetworkRefresh() // 3. State mutation. // DANGER: If another task interleaved during step 2, we might overwrite a valid token, // or we just unnecessarily performed multiple network requests. self.cachedToken = freshToken return freshToken } } To protect against this, you must rethink how you handle in-flight asynchronous operations. Instead of caching just the result, you often need to cache the Task itself. Swift actor SessionManager { private var cachedToken: String? private var refreshTask: Task<String, Error>? func getValidToken() async throws -> String { if let token = cachedToken { return token } // Return the in-flight task if one exists if let existingTask = refreshTask { return try await existingTask.value } // Otherwise, create a new task and cache IT immediately let task = Task { let freshToken = try await performNetworkRefresh() self.cachedToken = freshToken self.refreshTask = nil // Clean up return freshToken } self.refreshTask = task return try await task.value } } Always remember: across an await, your actor’s state is completely unguarded. Inside the Swift Concurrency Runtime To truly master structured concurrency, we need to step out of the syntax and into the engine room. Swift’s concurrency model isn’t just syntactic sugar over GCD; it is a completely bespoke, highly optimized runtime built around a cooperative thread pool. Understanding Jobs In the Swift runtime, a Job is the fundamental unit of schedulable work. When you write an async function, the compiler breaks your function down into partial tasks or “continuations” split at every await keyword. Each of these segments is wrapped into a Job. When a task suspends, the current Job finishes. When the awaited result is ready, a new Job is enqueued to resume the remainder of the function. Jobs are lightweight, heavily optimized, and managed entirely by the Swift runtime. How Executors Work If Jobs are the work, Executors are the environments where the work is allowed to happen. An executor defines the execution semantics for a set of Jobs. Every actor has a serial executor. This executor acts as a funnel, ensuring that only one Job associated with that actor runs at any given microsecond. When you call an actor method, you are submitting a Job to that actor’s executor. Custom Serial Executors (Actor Level) In the first example, we create a MainQueueExecutor conforming to SerialExecutor. This is particularly useful when you have a legacy codebase heavily dependent on a specific DispatchQueue and you want to wrap that logic into a modern Actor. Swift final class MainQueueExecutor: SerialExecutor { func enqueue(_ job: consuming ExecutorJob) { let unownedJob = UnownedJob(job) let unownedExecutor = asUnownedSerialExecutor() DispatchQueue.main.async { unownedJob.runSynchronously(on: unownedExecutor) } } func asUnownedSerialExecutor() -> UnownedSerialExecutor { UnownedSerialExecutor(ordinary: self) } } @globalActor actor CustomGlobalActor: GlobalActor { static let sharedUnownedExecutor = MainQueueExecutor() static let shared = CustomGlobalActor() nonisolated var unownedExecutor: UnownedSerialExecutor { Self.sharedUnownedExecutor.asUnownedSerialExecutor() } } Task Executors (Task Level) While a SerialExecutor protects an actor’s state, a TaskExecutor influences the “ambient” environment where a task and its children run. It doesn’t provide serial isolation; it provides a preferred execution location. Swift final class MainQueueExecutor: TaskExecutor { func enqueue(_ job: consuming ExecutorJob) { let unownedJob = UnownedJob(job) self.enqueue(unownedJob) } func enqueue(_ job: UnownedJob) { let unownedExecutor = asUnownedTaskExecutor() DispatchQueue.main.async { job.runSynchronously(on: unownedExecutor) } } func asUnownedTaskExecutor() -> UnownedTaskExecutor { UnownedTaskExecutor(ordinary: self) } } let executor = MainQueueExecutor() Task.detached(executorPreference: executor) { // TODO: Perform an async operation } What Workers Do Executors don’t magically run code; they need CPU threads. This is where Workers come in. In Swift concurrency, there is a global, cooperative thread pool. The threads inside this pool are the “workers.” Unlike GCD, which can spawn hundreds of threads, leading to thread explosion and massive memory overhead, the Swift thread pool is strictly limited, generally to the number of active CPU cores. However, this isn’t a hard-and-fast rule; there are specific cases where the pool may spawn more threads. We took a deep dive into this behavior in the article Swift Concurrency: Part 1. Workers ask executors for Jobs. When a worker thread picks up a Job from an executor, it executes it until completion or suspension. Because the number of workers is limited, Swift enforces a strict rule: you must never use blocking APIs (like semaphores or synchronous network calls) inside an async context. If you block a worker thread, you are permanently stealing a core from the concurrency runtime. The Role of Schedulers The Scheduler is the invisible conductor orchestrating this entire process. It decides which Jobs sit on which Executors, and which Workers get assigned to process them. The scheduler is highly priority-aware. When you spawn a Task(priority: .userInitiated), the scheduler ensures the resulting job jumps ahead of background jobs in the queue. It handles the complex logic of priority inversion avoidance, waking up worker threads, and balancing the load across the CPU. Types of Executors and How They’re Chosen Swift utilizes different types of executors depending on the context of your code: The global concurrent executor: If your code is not isolated to any actor (e.g., a detached task or a standalone async function), it runs on the default global concurrent executor. This executor distributes Jobs freely across all available workers in the cooperative thread pool.The main actor executor: This is a specialized serial executor permanently bound to the application’s main thread. The scheduler ensures that any Job submitted here is handed off to the main runloop.Default serial executors: Every standard actor you create gets its own default serial executor. The runtime dynamically maps this executor to any available worker thread in the pool as needed.Custom executors (Swift 5.9+): Advanced use cases might require overriding how an actor executes its jobs. By implementing the SerialExecutor protocol, you can create custom executors, for instance, to force an actor to run its jobs on a specific, legacy DispatchQueue to interoperate with older C++ or Objective-C codebases seamlessly. How the Runtime Chooses an Executor Understanding that executors exist is one thing; predicting exactly where your code will run is another. When a Job is ready to execute, the Swift runtime evaluates a precise decision tree to route that workload. Here is the exact algorithm the runtime uses to select an executor: Is the method isolated? (i.e., is it bound to a specific actor?) No (Non-isolated): Is there a preferred Task executor? Yes: The task executes on the Preferred Task Executor.No: The task executes on the standard Global Concurrent Executor.Yes (Actor-isolated): Does the actor provide its own custom executor? Yes: The task executes strictly on the Actor’s Custom Executor.No: Does the current Task have a preferred executor? Yes: The task executes on the Preferred Task Executor (while still strictly upholding the actor’s serial isolation).No: The task executes on the Default Actor Executor. This cascading logic ensures that actors maintain their state safety while allowing developers to influence the underlying execution environment when necessary. Inspecting Your Context: The #isolation Macro When dealing with deep call stacks and complex async boundaries, you might lose track of your current execution context. Swift 5.10 introduced a brilliant diagnostic tool to solve this: the #isolation macro. This macro evaluates at compile time to capture the actor isolation of the current context. It returns an any Actor? representing the actor you are currently isolated to, or nil if you are executing concurrently. Swift func debugCurrentContext() { // Prints the instance of the actor (like MainActor), or "no isolation" print(#isolation ?? "no isolation") } Sprinkling this into your logging infrastructure is invaluable when debugging data races or verifying that a heavy computation isn’t accidentally blocking the @MainActor. Task Executors vs. Actor Executors With recent advancements in Swift Evolution (specifically SE-0417 and SE-0392), developers now have the unprecedented ability to provide custom executors. However, to wield this power safely, you must deeply understand the difference between the two primary executor protocols: TaskExecutor and ActorExecutor (via SerialExecutor). What is a Task Executor? A Task Executor governs the execution environment for a specific Task hierarchy. Crucially, a Task Executor is inherently concurrent. It represents a thread pool or a concurrent queue where multiple jobs can be processed simultaneously. When you assign a preferred Task Executor, you are telling the runtime, “Unless an actor says otherwise, run the asynchronous work for this task pool over here.” What is an Actor Executor? An Actor Executor (which conforms to the SerialExecutor protocol) governs the execution environment for a specific actor instance. Unlike a Task Executor, an Actor Executor is strictly serial. It processes one job at a time, enforcing the mutual exclusion that makes actors safe from data races. The Danger of Custom Implementations Understanding the concurrent nature of Task Executors and the serial nature of Actor Executors is not just trivia, it is a strict runtime invariant. If you decide to write a custom executor (for example, wrapping an old C++ thread pool or a specific Grand Central Dispatch queue), you carry the burden of upholding these invariants: If you implement a SerialExecutor for an actor, but your underlying implementation accidentally allows concurrent execution, you will break the actor’s state isolation and introduce impossible-to-reproduce data races.Conversely, if you implement a TaskExecutor but back it with a serial queue, you risk starving the cooperative thread pool and introducing unexpected deadlocks across your async task hierarchies. The compiler trusts you to maintain these semantic guarantees. If you break them, the concurrency model shatters. Conclusion Swift concurrency is more than syntactic sugar for asynchronous code. It is a carefully designed execution model that formalizes how work is scheduled, isolated, and resumed. Actors provide safety guarantees, but understanding reentrancy and executor behavior is what allows engineers to reason about concurrency with confidence. By understanding these low-level mechanics when an actor temporarily releases isolation and how the runtime schedules jobs across worker threads, you can build iOS applications that are not only performant but also resilient to the subtle concurrency bugs that once plagued asynchronous systems.
Testing is an essential step in the API development process to ensure that APIs are working correctly. There are multiple HTTP methods in RESTful APIs, including POST, GET, PUT, PATCH, and DELETE. In our earlier articles, we learned how to perform automated testing of POST, PUT, and GET APIs using Rest-Assured Java. In this tutorial article, we will discuss and cover the following points: What is a PATCH API request?How to test PATCH API requests using REST-Assured Java? What Is a PATCH API Request? A PATCH request is used to update a resource partially. While it is similar to a PUT request, the key difference is that PUT requires the entire request body to be sent, whereas PATCH allows you to send only the specific fields that need to be updated. Let’s take an example of the following PATCH API that we’ll be using in this tutorial for demonstration purposes: PATCH (/partialUpdateOrder/{id}) This API endpoint partially updates the existing order in the system as per the provided Order ID. To update an existing order, this API requires the order ID as a path parameter so it knows which record to modify. The updated details should be provided in JSON format in the request body. Since this is a PATCH request, there’s no need to send the entire payload. Only the required field that needs to be updated should be included in the request body. Difference Between PATCH and PUT APIs The following table shows the difference between the PATCH and PUT APIs: criteriapatchputPurposePartially updates a resourcePartially updates a resourceRequest BodyOnly includes fields that need to be updatedRequires the full resource representationData SentOnly changed fieldsEntire data payloadIdempotencyNot always idempotentAlways idempotentUse CaseUpdating specific fieldsReplacing an entire recordRisk of Data LossLow, as the unchanged fields remain intactHigh, if some fields are omitted, they may be overwritten or removed How to Test PATCH APIs Using REST-Assured Java Let’s use the PATCH API /partialUpdateOrder/{id} and update an existing order partially in the system. Test scenario: Markdown ## Test Scenario Title: Partially update an existing orders in the system. ## Pre-condition: Valid orders are available in the system ## Test 1. Update the product_name and product_id for the order ID - 2 2. Verify that the Status Code 200 is returned in the response. 3. Assert that the order details have been updated correctly. Test Implementation The PATCH API is protected with authentication, so we would need the authentication token to access it successfully. To implement this test scenario, we’ll have to: Write a test to hit the Authorization API, generate and extract the token.Use the token generated in the first step and hit the PATCH API to update the order partially. Step 1: Write a test to hit the Authorization API, generate and extract the token. The POST /auth API endpoint should be hit with the following valid credentials to generate the token. JSON { "username": "admin", "password":"secretPass123" } It returns the following response: JSON { "message": "Authentication Successful!", "token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VybmFtZSI6ImFkbWluIiwiaWF0IjoxNzc1NjMzMDU5LCJleHAiOjE3NzU2MzY2NTl9.jHQwCyts9IejhwKGAZEm4Uyo9dgu5Kpe4OjTiYw1dm8" } The following test method is created to execute the POST /auth API request and extract the token from the response. Java @Test public void testTokenGeneration () { String requestBody = """ { "username": "admin", "password": "secretPass123" }"""; token = given ().contentType (ContentType.JSON) .when () .body (requestBody) .post ("http://localhost:3004/auth") .then () .statusCode (201) .and () .body ("token", notNullValue ()) .extract () .p The testTokenGeneration() test sends a POST request with login credentials to generate an authentication token using REST Assured. It verifies that the response returns a 201 status code and checks that a token is included in the response. Once the token is received, it’s extracted and stored in a global variable called token, so it can be reused across other test cases. Step 2: Partially updating the record with a PATCH API request. In this step, let’s add a new test method, testPartialUpdateOrder(), that sends a partial update request using the PATCH endpoint. The request body needs to be constructed with the required fields,i.e., product_id and product_name. We’ll use the Google Gson and Datafaker library to generate the request body. Java public class TestPatchRequestExamples { private String token; @Test public void testPartialUpdateOrder () { Faker faker = new Faker (); String productId = String.valueOf (faker.number () .numberBetween (1, 2000)); String productName = faker.commerce () .productName (); JsonObject orderDetail = new JsonObject (); orderDetail.addProperty ("product_id", productId); orderDetail.addProperty ("product_name", productName); //.. } This piece of code uses the Faker class from the Datafaker library and generates a random value for the product_id and product_name fields. The JSON object required for the request body is generated using the JsonObject class of the Google Gson library. The following request body is generated using this code: JSON {"product_id":"702","product_name":"Sleek Silk Plate"} Next, let’s write the automated test to update the record using the PATCH API endpoint. Java @Test public void testPartialUpdateOrder () { int orderId = 2; //.. given ().contentType (ContentType.JSON) .header ("Authorization", token) .when () .log () .all () .body (orderDetail.toString ()) .patch ("http://localhost:3004/partialUpdateOrder/" + orderId) .then () .log () .all () .statusCode (200) .and () .assertThat () .body ("message", equalTo ("Order updated successfully!"), "order.product_id", equalTo (productId), This test sends a PATCH API request for the order ID 2. The request body we created earlier is included in the request, containing only the fields that need to be updated. given().contentType(ContentType.JSON): It specifies that the request body will be in JSON format..header(“Authorization”, token): It adds the authentication token to the request header, which is required to authorize the API call..when().log().all(): This statement starts the request execution and logs all request details(headers, body, etc.)..body(orderDetail.toString()): It sets the request payload. The orderDetails (created earlier) JSON contains only the fields that need to be updated..patch(“http://localhost:3004/partialUpdateOrder/”+ orderId): It sends the PATCH request to update the order partially with the specified order ID..then().log().all(): It logs the full response for better visibility of the test execution..statusCode(200): It verifies that the API request was sent and the API responded with a 200 OK status..and().assertThat().body(…): It performs multiple assertions on the response body as shown below: The value of the “message” field should be “Order updated successfully!”The value of the “product_id” and “product_name” fields in the order object should be the same as supplied in the request. Using a dynamic approach to generate the request body with DataFaker helps eliminate repetitive code and promotes better reusability across test cases. Check out this tutorial for more information related to response verification Test Execution As we discussed in the earlier tutorial on testing PUT API requests with REST Assured, we need to follow the same approach to generate the token first, then use it to hit the PATCH API request. Let’s create the following testng.xml file for executing the tests sequentially: XML <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE suite SYSTEM "http://testng.org/testng-1.0.dtd"> <suite name="Restful ECommerce Test Suite"> <test name="Restful ECommerce End to End tests"> <classes> <class name="restfulecommerce.tutorial.TestPatchRequestExamples"> <methods> <include name="testTokenGeneration"/> <include name="testPartialUpdateOrder"/> </methods> </class> </classes> </test> </suite> The following screenshot of test execution shows that the tests were executed successfully and the order was partially updated. The following log was printed in the console after test execution, showing the request and the response details: Plain Text Request method: PATCH Request URI: http://localhost:3004/partialUpdateOrder/2 Proxy: <none> Request params: <none> Query params: <none> Form params: <none> Path params: <none> Headers: Authorization=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VybmFtZSI6ImFkbWluIiwiaWF0IjoxNzc1NjMyMDEzLCJleHAiOjE3NzU2MzU2MTN9.A10-amp24LKDrDKrRJ6BW1KKtkVLQ-QK71U_Jl1ctDs Accept=*/* Content-Type=application/json Cookies: <none> Multiparts: <none> Body: { "product_id": "702", "product_name": "Sleek Silk Plate" } HTTP/1.1 200 OK X-Powered-By: Express Content-Type: application/json; charset=utf-8 Content-Length: 188 ETag: W/"bc-NGDglqodj+ZJKoZsbosa9746aT0" Date: Wed, 08 Apr 2026 07:06:54 GMT Connection: keep-alive Keep-Alive: timeout=5 { "message": "Order updated successfully!", "order": { "id": 2, "user_id": "1", "product_id": "702", "product_name": "Sleek Silk Plate", "product_amount": 750, "qty": 1, "tax_amt": 7.99, "total_amt": 757.99 } } It can be seen that the two fields product_id and product_name have randomly generated values and are sent in the request along with the order ID 2. In the response, a 200 OK status code is returned along with the full response body showing the same product_id and product_name. These details, as well as the assertions used in the tests, confirm that the order was successfully updated. Summary Effectively testing PATCH APIs in automation involves validating partial updates by sending only the required fields and verifying that unchanged data remains intact. Using a dynamic approach such as DataFaker, Google Gson, and constructing a request body with POJOs, Builder Pattern, JSON files, or Java Map helps generate fresh test data, reducing duplication issues and easing maintenance. Following best practices such as proper authentication handling, validating response body and status code, and keeping tests reusable and maintainable ensures robust and scalable API test automation. Happy testing!
TL;DR: Understand the Claude Desktop Architecture and Save Time You configured Claude in Claude Desktop, wrote instructions, uploaded reference files, and set your preferences. Then you clicked the Cowork tab. Unfortunately, Claude had no memory of what you just did. Your instructions were gone, as were your files and preferences. You assumed this was a bug, but it is a feature: You switched applications. The Claude Desktop App Hosts 3 Separate Applications The tabs at the top of Claude Desktop (Chat, Cowork, Code) appear to be views of the same product. They are not. For example, Anthropic’s own documentation describes Cowork as using “the same agentic architecture that powers Claude Code”. However, in practice, each tab runs on a different execution layer with its own sandbox, memory system, and instruction hierarchy. The architectural split matters: Cowork and Code share an engine. Chat is a separate system entirely. A useful functional shorthand is as follows: Chat is for thinking: It runs in the cloud on Anthropic’s servers. It cannot access files on your machine; you have to provide those. In Chat, you converse, you reason, you get answers.Cowork is for doing: It runs inside a sandboxed Linux virtual machine, or VM in short, on your local computer. It reads and writes files in folders you mount, works autonomously in the background, and wipes the VM after every session. (Which is also, as you may imagine, the reason that Cowork does not remember previous sessions: The previously used VM is gone.)Code is for building: It runs natively in your terminal with full system access and no sandbox. It is made for engineers. So, there is an architectural reason why the instructions you just spent 20 minutes writing do not follow you when you move between tabs. Let’s see what crosses the tab boundary and what does not: The Word “Project” Means 3 Different Things This is the collision that wastes the most time. Which of these three did you configure last week? The Cowork Projects documentation confirms that Cowork projects live locally on your desktop, separate from Chat Projects. Your Chat Project knowledge base is invisible to Cowork and Code. When Cowork says “choose a project,” it offers three options: start from scratch (a new folder), import from a Chat Project (a one-way snapshot, not a live link, not future synchronization between the two either), or use an existing folder on your hard drive. The word “Project” appears three times on that screen, referring to different things. Memory, Artifacts, and Instructions Collide, too Given the current architectural state of three different Claude apps, posing as one, this pattern repeats across every shared term. Memory: Chat auto-summarizes your conversations in the cloud. Cowork has project-scoped memory only (Note: that refers to “projects” listed in the sidebar.) Standalone Cowork sessions without a project remember nothing, because the VM that ran the session is wiped when it ends. Code uses CLAUDE.md files, plus an auto-memory system.Artifacts: In Chat, an artifact is a rendered preview in a side panel (HTML, React, SVG). In Cowork, the same word means a real file on your disk (.docx, .xlsx, .pdf) or a Live Artifact (a persistent interactive dashboard that survives session restarts).Instructions: Chat has two instruction locations (Profile Preferences and Project Instructions) plus a Styles selector for writing tone. Cowork has three different locations (Global, Folder, Project). Code has a five-tier hierarchy: managed policies, CLI flags, .claude/settings.local.json, .claude/settings.json, and ~/.claude/settings.json, plus CLAUDE.md files at user, project, and local levels. None of the instructions syncs across tabs. Count the instruction locations you have configured. Now count the ones you assumed were active in a different tab. That is the gap. Watch Out When Working With Claude Desktop: Back Up Your Folder Before Your First Cowork Session Cowork’s sandbox prevents access to files outside your mounted folder. Inside that folder, Cowork has full read and write access. It does not archive. It does not move files to a trash folder. When it deletes, the files are gone. On the day Cowork launched in January 2026, a user recorded their first session on video. They asked Cowork to “clean up” a folder. Cowork ran an rm -rf command inside the autonomous Linux VM and permanently deleted 11 GB of files. The video went viral on Hacker News. Anthropic has since added a deletion confirmation prompt that requires explicit permission before Cowork permanently deletes any files. The underlying access model has not changed: inside your mounted folder, Cowork can do anything. As of May 2026, these actions leave no audit trail. Anthropic states this directly: “Do not use Cowork for regulated workloads.” If you work in a regulated industry, that sentence applies to you. If it is gone, it is gone. Back up every folder you mount to Cowork, that’s non-negotiable. Obviously, Anthropic Knows About the Tab Isolation Dispatch, available as a research preview for Pro and Max plans, lets you send tasks from your phone to a Cowork session running on your desktop. It is a mobile-to-desktop bridge. The isolation between Chat, Cowork, and Code remains. Dispatch signals where the product is heading. 2 Documents So You Do Not Have to Discover This The Hard Way I put together three companion documents for the introductory module of my upcoming Claude Cowork Online Course. They cover the architecture, the terminology collisions, and the practical setup steps. I am sharing two of them here because the confusion they address is real and widespread, and nobody should have to discover these things by losing work: The Quick Reference Card maps Chat, Cowork, and Code across nine dimensions: environment, file access, sandbox, execution model, project type, memory, output type, extensions, and instruction locations. Pin it to your wall or keep it open during your first week with Cowork: Working with Cowork: Quick Reference Card.The Terminology Collision Glossary maps eight terms (Project, Memory, Artifacts, Instructions, Workspace, Session, Tool, Agent) across four surfaces (Chat, Cowork, Code, API). The “Project” row alone will save you thirty minutes of confusion: Working with Cowork: Terminology Glossary. Conclusion: Before You Start With Claude Desktop and Cowork, Take 4 Steps in 5 Minutes If you are about to use Cowork for the first time, do these four things: Create a dedicated folder for Cowork. Not your Documents folder. Not your Desktop. A purpose-built folder with a clear name within your existing local file system.Set up backups for that folder before you mount it. Time Machine on macOS. File History on Windows. Git if you prefer. Do this before you give Cowork access.Open Cowork, create a project by choosing Project from the sidebar and clicking “New Project”, and point it at that folder. Write one sentence of instructions describing what you use this workspace for. (You can iterate on the instructions later.)Switch between all three tabs. Verify for yourself that your Project, your instructions, and your memory do not follow you. Invest five minutes of your time, and these four steps prevent the mistakes that cost people hours. Once you stop fighting Claude Desktop’s architecture and start working with it, Cowork becomes a different tool entirely. That is what the rest of the course is about.
Dear Micromanager: Your Distrust Has a Job; It’s Just Not the One You’re Doing
May 21, 2026
by
CORE