The cultural movement that is DevOps — which, in short, encourages close collaboration among developers, IT operations, and system admins — also encompasses a set of tools, techniques, and practices. As part of DevOps, the CI/CD process incorporates automation into the SDLC, allowing teams to integrate and deliver incremental changes iteratively and at a quicker pace. Together, these human- and technology-oriented elements enable smooth, fast, and quality software releases. This Zone is your go-to source on all things DevOps and CI/CD (end to end!).
A Tool Is Not a Platform (And Your Team Knows the Difference)
Implementing Asynchronous Communication Between Microservices Using Kafka and Spring Boot
Nowadays, there are quite a lot of AI coding assistants. In this blog, you will take a closer look at GitHub Code CLI, a terminal-based AI coding assistant. GitHub Copilot CLI integrates smoothly with GitHub Copilot, so if you have a GitHub Copilot subscription, it is definitely worth looking at. Enjoy! Introduction There are many AI models and also many AI coding assistants. Which one to choose is a hard question. It also depends on whether you run the models locally or in the cloud. When running locally, Qwen3-Coder is a very good AI model to be used for programming tasks. In previous posts, DevoxxGenie, a JetBrains IDE plugin, was often used as an AI coding assistant. DevoxxGenie is nicely integrated within the JetBrains IDE's. But it is also a good thing to take a look at other AI coding assistants. In previous blogs, Qwen Code and Claude Code were used in combination with local models. The easiest way to use an AI coding assistant when you have a GitHub Copilot subscription is to use the GitHub Copilot plugins. The Visual Studio Code GitHub Copilot plugin is feature complete. The IntelliJ GitHub Copilot plugin, however, receives very bad reviews. But you do not need to have a full IDE integration. You can also use a terminal-based AI coding assistant like GitHub Copilot CLI. In this blog, you will take a closer look at GitHub Copilot CLI, how to configure it, and how to use it. The official documentation for GitHub Copilot CLI can be found here. Sources used in this blog can be found on GitHub. Prerequisites Prerequisites for reading this blog are: Some experience with AI coding assistants;If you want to compare to DevoxxGenie, take a look at a previous post;You need a GitHub account. Installation Installation instructions for GitHub Copilot CLI can be found here. Execute the following bash script. Shell curl -fsSL https://gh.io/copilot-install | bash Enable GitHub Copilot A prerequisite is a GitHub account. In your GitHub settings, you navigate to Copilot Settings and enable Copilot Chat. Be sure that Copilot CLI is also enabled. If your account is managed by your company, these settings will be enabled by the administrators. Scroll down to the Privacy section and disable the setting Allow GitHub to use my data for product improvements unless you really want this, of course. First Startup If you haven't done it already, now is the time to clone the GitHub repository. Be sure to check out the github-copilot branch. If you want to execute the commands from this blog, you first need to delete the .github and src/test directories. Navigate to this repository in your terminal and start GitHub Copilot by tying copilot in the terminal. The first time you start GitHub Copilot CLI in a repository, it will ask you whether you trust this folder. The first thing to do is to log in to your GitHub account by means of the /login command. Shell /login What account do you want to log into? 1. GitHub.com 2. GitHub Enterprise Cloud with data residency (*.ghe.com) Choose GitHub.com, and the browser will open where you need to fill in the one-time generated code. Shell Waiting for authorization... Enter one-time code: 1234-5678 at https://github.com/login/device Press any key to copy to clipboard and open browser... Log in and authorize GitHub Copilot CLI. If all went successfully, a message like the one below will be shown. Available Models Use the command /models in order to see which models are available. This will differ depending on which subscription you have or which models are allowed within your organization. First Prompt Ask a first prompt. Plain Text ❯ how are you? ◐ The user is asking a simple greeting question "how are you?" This is a conversational question, not a request to perform any task. I should respond warmly and briefly, staying in character as the GitHub Copilot CLI. I don't need to use any tools for this - it's just a greeting. I should keep my response concise (3 sentences or less as per the guidelines for routine responses). ● Hey! I'm doing well, thanks for asking. I'm ready to help you with your software engineering tasks. What can I do for you? Create a Test Let's continue with something useful and create a test for the CustomersController. Using the @ character, you can add files to the context. When typing, a search is executed, and using the arrows, you can easily select the file you need. Using the tab, you select the file. After that, you can complete the prompt. The prompt used is: Shell @src/main/java/com/mydeveloperplanet/myaicodeprojectplanet/controller/CustomersController.java Write a unit test for this code using JUnit. Use WebMvcTest. Use MockMvc. Use AssertJ assertions. Add the test in this repository GitHub Copilot CLI starts analyzing the file and writes the test. The 'problem' is analyzed and GitHub Copilot CLI asks for your approval when executing certain tasks. It even runs the tests and fixes issues with the test. The approval looks as follows. Plain Text 1. Yes 2. Yes, and approve all file operations for the rest of the running session 3. No, and tell Copilot what to do differently (Esc to stop) In the end, the test is generated, and a clear summary is given. Plain Text Excellent! ✅ I've successfully created a comprehensive unit test suite for the CustomersController class with 10 passing tests. Here's what was included: Test Coverage The test file (CustomersControllerTest.java) uses: - @WebMvcTest annotation for controller testing - MockMvc for making HTTP requests and asserting responses - @MockitoBean for mocking the CustomerService - AssertJ assertions for validating results Tests Created 1. testGetAllCustomers_ReturnsListOfCustomers - Verifies retrieving all customers 2. testGetAllCustomers_ReturnsEmptyList - Tests empty list scenario 3. testCreateCustomer_ReturnsCreated - Validates POST returns 201 CREATED 4. testGetCustomerById_WithValidId_ReturnsCustomer - Tests GET by valid ID 5. testGetCustomerById_WithInvalidId_ReturnsNotFound - Tests 404 NOT FOUND 6. testUpdateCustomer_WithValidId_ReturnsOk - Validates PUT request 7. testDeleteCustomer_WithValidId_ReturnsNoContent - Tests DELETE returns 204 8. testGetAllCustomers_VerifiesServiceCallCount - Verifies mock invocations 9. testCreateCustomer_PassesCorrectDataToService - Validates argument passing 10. testUpdateCustomer_PassesCorrectIdAndData - Validates update parameters The tests are located at: /home/<project directory>/myaicodeprojectplanet/src/test/java/com/mydeveloperplanet/myaicodeprojectplanet/controller/CustomersControllerTest.java All tests passed successfully! ✅ The generated test can be seen below. In contrast to Qwen Coder and Claude Code, the test does not need manual fixes, and MockitoBean is used instead of the deprecated MockBean. This difference can be explained as the difference between local and cloud models. Java package com.mydeveloperplanet.myaicodeprojectplanet.controller; import com.mydeveloperplanet.myaicodeprojectplanet.model.Customer; import com.mydeveloperplanet.myaicodeprojectplanet.service.CustomerService; import com.fasterxml.jackson.databind.ObjectMapper; import org.junit.jupiter.api.BeforeEach; import org.junit.jupiter.api.Test; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.boot.test.autoconfigure.web.servlet.WebMvcTest; import org.springframework.test.context.bean.override.mockito.MockitoBean; import org.springframework.http.MediaType; import org.springframework.test.web.servlet.MockMvc; import java.util.Arrays; import java.util.List; import java.util.Optional; import static org.assertj.core.api.Assertions.*; import static org.mockito.ArgumentMatchers.*; import static org.mockito.Mockito.*; import static org.springframework.test.web.servlet.request.MockMvcRequestBuilders.*; import static org.springframework.test.web.servlet.result.MockMvcResultMatchers.*; @WebMvcTest(CustomersController.class) class CustomersControllerTest { @Autowired private MockMvc mockMvc; @MockitoBean private CustomerService customerService; @Autowired private ObjectMapper objectMapper; private Customer testCustomer; private com.mydeveloperplanet.myaicodeprojectplanet.openapi.model.Customer openAPICustomer; @BeforeEach void setUp() { testCustomer = new Customer(1L, "John", "Doe"); openAPICustomer = new com.mydeveloperplanet.myaicodeprojectplanet.openapi.model.Customer(); openAPICustomer.setId(1L); openAPICustomer.setFirstName("John"); openAPICustomer.setLastName("Doe"); } @Test void testGetAllCustomers_ReturnsListOfCustomers() throws Exception { // Arrange Customer customer2 = new Customer(2L, "Jane", "Smith"); List<Customer> customers = Arrays.asList(testCustomer, customer2); when(customerService.getAllCustomers()).thenReturn(customers); // Act & Assert mockMvc.perform(get("/customers") .contentType(MediaType.APPLICATION_JSON)) .andExpect(status().isOk()) .andExpect(content().contentType(MediaType.APPLICATION_JSON)) .andDo(result -> { String responseBody = result.getResponse().getContentAsString(); assertThat(responseBody).contains("John", "Doe", "Jane", "Smith"); }); verify(customerService, times(1)).getAllCustomers(); } @Test void testGetAllCustomers_ReturnsEmptyList() throws Exception { // Arrange when(customerService.getAllCustomers()).thenReturn(Arrays.asList()); // Act & Assert mockMvc.perform(get("/customers") .contentType(MediaType.APPLICATION_JSON)) .andExpect(status().isOk()) .andExpect(content().json("[]")); verify(customerService, times(1)).getAllCustomers(); } @Test void testCreateCustomer_ReturnsCreated() throws Exception { // Arrange when(customerService.createCustomer(any(Customer.class))).thenReturn(testCustomer); // Act & Assert mockMvc.perform(post("/customers") .contentType(MediaType.APPLICATION_JSON) .content(objectMapper.writeValueAsString(openAPICustomer))) .andExpect(status().isCreated()); verify(customerService, times(1)).createCustomer(any(Customer.class)); } @Test void testGetCustomerById_WithValidId_ReturnsCustomer() throws Exception { // Arrange when(customerService.getCustomerById(1L)).thenReturn(Optional.of(testCustomer)); // Act & Assert mockMvc.perform(get("/customers/1") .contentType(MediaType.APPLICATION_JSON)) .andExpect(status().isOk()) .andExpect(content().contentType(MediaType.APPLICATION_JSON)) .andDo(result -> { String responseBody = result.getResponse().getContentAsString(); assertThat(responseBody).contains("John", "Doe", "1"); }); verify(customerService, times(1)).getCustomerById(1L); } @Test void testGetCustomerById_WithInvalidId_ReturnsNotFound() throws Exception { // Arrange when(customerService.getCustomerById(999L)).thenReturn(Optional.empty()); // Act & Assert mockMvc.perform(get("/customers/999") .contentType(MediaType.APPLICATION_JSON)) .andExpect(status().isNotFound()); verify(customerService, times(1)).getCustomerById(999L); } @Test void testUpdateCustomer_WithValidId_ReturnsOk() throws Exception { // Arrange Customer updatedCustomer = new Customer(1L, "John", "Updated"); when(customerService.updateCustomer(eq(1L), any(Customer.class))).thenReturn(updatedCustomer); com.mydeveloperplanet.myaicodeprojectplanet.openapi.model.Customer updateRequest = new com.mydeveloperplanet.myaicodeprojectplanet.openapi.model.Customer(); updateRequest.setId(1L); updateRequest.setFirstName("John"); updateRequest.setLastName("Updated"); // Act & Assert mockMvc.perform(put("/customers/1") .contentType(MediaType.APPLICATION_JSON) .content(objectMapper.writeValueAsString(updateRequest))) .andExpect(status().isOk()); verify(customerService, times(1)).updateCustomer(eq(1L), any(Customer.class)); } @Test void testDeleteCustomer_WithValidId_ReturnsNoContent() throws Exception { // Arrange doNothing().when(customerService).deleteCustomer(1L); // Act & Assert mockMvc.perform(delete("/customers/1") .contentType(MediaType.APPLICATION_JSON)) .andExpect(status().isNoContent()); verify(customerService, times(1)).deleteCustomer(1L); } @Test void testGetAllCustomers_VerifiesServiceCallCount() throws Exception { // Arrange when(customerService.getAllCustomers()).thenReturn(Arrays.asList(testCustomer)); // Act mockMvc.perform(get("/customers")); mockMvc.perform(get("/customers")); // Assert verify(customerService, times(2)).getAllCustomers(); } @Test void testCreateCustomer_PassesCorrectDataToService() throws Exception { // Arrange when(customerService.createCustomer(any(Customer.class))).thenReturn(testCustomer); // Act & Assert mockMvc.perform(post("/customers") .contentType(MediaType.APPLICATION_JSON) .content(objectMapper.writeValueAsString(openAPICustomer))) .andExpect(status().isCreated()); verify(customerService, times(1)).createCustomer(argThat(customer -> customer.getFirstName().equals("John") && customer.getLastName().equals("Doe") )); } @Test void testUpdateCustomer_PassesCorrectIdAndData() throws Exception { // Arrange Customer updatedCustomer = new Customer(1L, "John", "Updated"); when(customerService.updateCustomer(eq(1L), any(Customer.class))).thenReturn(updatedCustomer); com.mydeveloperplanet.myaicodeprojectplanet.openapi.model.Customer updateRequest = new com.mydeveloperplanet.myaicodeprojectplanet.openapi.model.Customer(); updateRequest.setFirstName("John"); updateRequest.setLastName("Updated"); // Act & Assert mockMvc.perform(put("/customers/1") .contentType(MediaType.APPLICATION_JSON) .content(objectMapper.writeValueAsString(updateRequest))) .andExpect(status().isOk()); verify(customerService, times(1)).updateCustomer(eq(1L), argThat(customer -> customer.getFirstName().equals("John") && customer.getLastName().equals("Updated") )); } } When you generate the mutation test results using mvn verify and check the report in the target/pit-reports directory, you notice that this test has a line coverage of 100% and a mutation coverage of 93%, which is quite good. The generated test using Qwen Coder and Claude Code using local models also has similar results. Commands Several built-in commands are available. Command /clear clears the history. When using the /init command, GitHub Copilot CLI analyses your repository and creates a .github/copilot-instructions.md file in your repository with project-specific information. Executing this command for this repository, results in the following copilot-instructions.md file. The result is really good. Java # Copilot Instructions for MyAiCodeProjectPlanet ## Quick Start This is a **Spring Boot 3.5.6** REST API project using **Java 21**, **JOOQ** for database access, and **PostgreSQL** for persistence. It includes OpenAPI schema-driven development. ## Build, Test, and Run ### Build the project ```bash mvn clean install ``` ### Run the application ```bash mvn spring-boot:run ``` The application starts on `http://localhost:8080` with PostgreSQL automatically started via Docker Compose integration. ### Run all tests ```bash mvn test ``` ### Run a single test ```bash mvn test -Dtest=CustomersControllerTest ``` ### Mutation testing (PIT) ```bash mvn pitest:mutationCoverage ``` Results are in `target/pit-reports/`. ### Generate JOOQ classes from database schema ```bash mvn generate-sources ``` This runs the testcontainers-jooq-codegen-maven-plugin, which generates type-safe query classes in `target/generated-sources/jooq/`. ## Architecture ### Layered Structure The application follows a **classic 3-tier architecture**: - **Controller Layer** (`controller/`): REST endpoints implementing OpenAPI-generated interfaces - **Service Layer** (`service/`): Business logic with `CustomerService` interface and `CustomerServiceImpl` implementation - **Repository Layer** (`repository/`): Data access using JOOQ's DSL for type-safe queries - **Model Layer** (`model/`): Domain objects (e.g., `Customer`) ### API-First Development The API is defined in `src/main/resources/static/customers.yaml` (OpenAPI spec). The OpenAPI Maven plugin auto-generates service interfaces in `com.mydeveloperplanet.myaicodeprojectplanet.openapi`. Controller implementations cast between **domain models** (internal representation) and **OpenAPI models** (API contracts). This separation isolates API changes from business logic. ### Database Access Pattern - Uses **JOOQ** for type-safe SQL queries (not JPA/Hibernate) - Generated JOOQ classes from schema located in `com.mydeveloperplanet.myaicodeprojectplanet.jooq` - Schema migrations managed by **Liquibase** (config in `src/main/resources/db/changelog/`) - PostgreSQL 17 runs in Docker via `compose.yaml` with Spring Boot's docker-compose support ### Key Dependencies - `spring-boot-starter-web`: REST endpoints and Spring MVC - `spring-boot-starter-jooq`: JOOQ integration - `spring-boot-docker-compose`: Auto-starts PostgreSQL container - `testcontainers-jooq-codegen-maven-plugin`: Generates JOOQ classes during build - `openapi-generator-maven-plugin`: Generates API interfaces from YAML spec - `pitest-maven`: Mutation testing for code quality validation ## Key Conventions ### Model Conversion Pattern Controllers convert between two model layers: - **Domain models** (`Customer` in `model/`): Core business objects - **OpenAPI models** (`com.mydeveloperplanet.myaicodeprojectplanet.openapi.model.Customer`): API-specific DTOs This is done explicitly in controller methods using `convertToOpenAPIModel()` and `convertToDomainModel()` helpers. Maintain this separation when adding new endpoints. ### Service Layer Usage - All business logic resides in service implementations - Controllers inject services via `@Autowired` (not constructor injection yet) - Services return domain models; controllers handle API model conversion ### Repository Method Signatures Repository methods return domain models, not JOOQ records. Internal mapping is done via `convertToCustomer()`. This keeps JOOQ types hidden from upper layers. ### Testing - Unit tests are in `src/test/java/` mirroring source structure - Use `@SpringBootTest` for integration tests requiring Spring context - Consider Testcontainers for database integration tests (already a dependency) ## Common Tasks ### Adding a New Endpoint 1. Update `src/main/resources/static/customers.yaml` with the new operation 2. Run `mvn generate-sources` to regenerate OpenAPI interfaces 3. Implement the new method in `CustomersController` 4. Add business logic to `CustomerServiceImpl` 5. Extend `CustomerRepository` if new database queries are needed 6. Write tests in `CustomersControllerTest` ### Adding a New Domain Entity 1. Create domain model class in `model/` 2. Add schema changes to Liquibase changelog (if database entity) 3. Create repository class in `repository/` for data access 4. Create service interface and implementation in `service/` 5. Create controller in `controller/` 6. Add OpenAPI spec to the YAML file and regenerate ### Debugging Locally - PostgreSQL logs are visible in console output when running `mvn spring-boot:run` - JOOQ-generated SQL is logged at DEBUG level; enable in `application.properties` if needed - Use `mvn test -X` for Maven debug output ## Notes - **Java 21 records** may be used where appropriate (modern codebase target) - **PIT mutation testing** is configured; commit confidence is validated via mutation coverage - **Liquibase** handles schema versioning-database changes go in changelog files, not direct SQL - The `.mvn/` directory contains Maven wrapper; `./mvnw` works on Unix/macOS, `.\mvnw.cmd` on Windows MCP With MCP (Model Context Protocol) servers, you can enhance the capabilities of the model. It should be possible to define a file mcp-config.json in the .copilot directory. For example, the following configuration can be added. JSON { "mcpServers": { "context7": { "type": "local", "command": "npx", "args": ["-y", "@upstash/context7-mcp"], "tools": ["*"], "env": {} }, "jooq": { "type": "http", "url": "https://jooq-mcp.martinelli.ch/mcp", "tools": ["*"] }, "javadoc": { "type": "http", "url": "https://www.javadocs.dev/mcp", "tools": ["*"] } } } However, when you invoke the command /mcp show, the following is shown. JSON No user-configured servers. Built-in: ❯ ✓ github-mcp-server http https://api.individual.githubcopilot.com/mcp/readonly Config: ~/.copilot/mcp-config.json It looks like the config is read, but the MCP servers do not seem to be recognized. A solution is to add them manually with command /mcp add. Conclusion GitHub Copilot CLI offers quite some nice features. There is a lot more to discover, but the first impressions are good. It is also good to experiment with other AI coding assistants now and then, in order to see how they compare to the ones you are using. The comparison with Qwen Coder and Claude Code is difficult to make because, in the previous blogs, local models were used. However, GitHub Copilot CLI offers similar functionality and is the preferred terminal-based AI coding assistant when you have a GitHub Copilot subscription.
If you've worked on a data platform for more than a few years, you've almost certainly built the same pipeline twice. First, the way the team wrote pipelines in 2019: a notebook here, a Python script there, an Airflow DAG to glue it all together, and a long document explaining the order things had to run in. Then the rewrite, two years later, when somebody quit, and nobody could remember why a particular task had a sleep(180) in it. Lakeflow is Databricks' answer to that pattern, and the shift it's pushing for is bigger than the marketing makes it sound. It isn't a new orchestrator. It's a move from imperative pipelines, where you write the steps, to declarative pipelines, where you write the destination and let the engine figure out the steps. What follows is the practical version of that shift — what's actually different, where the gains are real, and how to migrate without ending up with a half-converted lakehouse. 1. The Imperative ETL Trap: Why Traditional Pipelines Are Hitting a Wall Imperative ETL is a fancy name for the way most pipelines are still written: a sequence of steps, hand-ordered, run on a schedule. It works fine until it doesn't, and the failure modes are remarkably consistent across teams I've worked with: The DAG outgrows its author. The person who wrote the original 30-task Airflow DAG moves teams. The next engineer is afraid to delete anything because they can't tell which tasks are still needed.Backfills are surgical operations. Re-running yesterday means manually figuring out which downstream tables are stale, in what order. Half the team's tribal knowledge lives in Slack threads about backfills.Quality checks are bolted on. Data quality lives in a separate framework, often a separate codebase, often run by a separate team. By the time a check fails, the bad data is already in the warehouse.Lineage is a slide in a deck. Whatever lineage exists was drawn by hand for a quarterly review and was out of date the day after. None of these are bugs in the imperative model. They're features of it. When you write the steps, you own the steps — including all the cross-task assumptions the engine doesn't know about. 2. What "Declarative" Actually Means in Lakeflow Declarative is one of those words that gets used loosely. In Lakeflow Pipelines, it has a specific, narrow meaning: you describe each table's logical definition (its source query, its expected schema, its quality rules), and the engine determines execution. It picks the order. It decides which tables are streaming and which are batch. It scales the cluster. It figures out incremental processing. It produces lineage automatically because lineage is now a derived property of the dependency graph it built for you. What it isn't: It isn't "low-code." You're still writing SQL or PySpark. The thing that's gone is the orchestration boilerplate around it.It isn't a magic upgrade for any pipeline. Pipelines that genuinely need procedural logic — multi-step API calls with branching, complex pre/post-processing — still belong in Lakeflow Jobs (the orchestrator) or even external code, called from the pipeline.It isn't free. There's a learning curve in stopping yourself from writing the steps you used to write. The first month, most teams over-specify. The mental shift: stop describing how the data should flow. Describe what each table is. Lakeflow figures out the flow. 3. The Lakeflow Architecture: Connect, Pipelines, Jobs Lakeflow is three components that share one governance layer (Unity Catalog). They map roughly onto the three traditional layers of a pipeline — ingestion, transformation, orchestration — but with the imperative wiring removed. Figure 1. Lakeflow's three components on top of Unity Catalog. Pipelines is the declarative core; Connect feeds it, Jobs schedules it. A few practical points about this picture. Lakeflow Connect is where managed connectors live (Salesforce, Workday, Postgres CDC, and a steadily growing list); it's the part you reach for instead of writing yet another ingestion script. Lakeflow Pipelines is where the declarative paradigm actually lives — every other component is conventional. And Lakeflow Jobs is the part that looks most like Airflow: task graphs, retries, alerts. The trick is that the things inside a Pipelines task aren't tasks themselves — they're table definitions, and the engine builds the internal DAG from their dependencies. 4. Translating an Imperative Pipeline to a Declarative One The clearest way to feel the difference is to look at the same logic written both ways. Imagine a small bronze→silver→gold pipeline for transactions: ingest raw files, deduplicate, then aggregate to daily totals. 4a. The imperative version (notebook + Airflow style) Python # bronze.py df = spark.read.json("s3://landing/txns/") df.write.format("delta").mode("append").saveAsTable("bronze.txns") # silver.py -- runs after bronze finishes raw = spark.table("bronze.txns") clean = (raw.dropDuplicates(["txn_id"]) .filter("amount IS NOT NULL")) clean.write.format("delta").mode("overwrite").saveAsTable("silver.txns") # gold.py -- runs after silver finishes agg = (spark.table("silver.txns") .groupBy("ingest_date", "account_id") .sum("amount") .withColumnRenamed("sum(amount)", "daily_total")) agg.write.format("delta").mode("overwrite").saveAsTable("gold.daily_totals") # airflow_dag.py -- the part that actually controls execution bronze_task >> silver_task >> gold_task 4b. The same logic, declared in a Lakeflow Pipeline Python import dlt from pyspark.sql.functions import sum as _sum @dlt.table( name="bronze_txns", comment="Raw transactions landed from S3.", ) def bronze_txns(): return (spark.readStream .format("cloudFiles") .option("cloudFiles.format", "json") .load("s3://landing/txns/")) @dlt.table(name="silver_txns", comment="Deduplicated, validated transactions.") @dlt.expect_or_drop("non_null_amount", "amount IS NOT NULL") @dlt.expect("unique_txn", "txn_id IS NOT NULL") def silver_txns(): return (dlt.read_stream("bronze_txns") .dropDuplicates(["txn_id"])) @dlt.table(name="gold_daily_totals") def gold_daily_totals(): return (dlt.read("silver_txns") .groupBy("ingest_date", "account_id") .agg(_sum("amount").alias("daily_total"))) Two things vanished in the rewrite. There is no DAG file, because the dependencies are inferred from dlt.read / dlt.read_stream calls. There is no separate data quality framework — quality lives next to the table definition, where it belongs. The engine decides what's streaming and what's batch from the calls themselves; bronze is a stream, silver is a stream of the bronze stream, gold is a batch over silver. None of that ordering is in the code I wrote. 5. Quality, Lineage, and Operational Visibility for Free The expectations decorators above (@dlt.expect, @dlt.expect_or_drop, and the stricter @dlt.expect_or_fail) are not just convenience syntax; they become first-class objects in the pipeline. Every run produces a per-expectation pass/fail count, queryable directly: SQL -- How many silver rows failed each expectation, per run, last 7 days SELECT pipeline_run_id, flow_name, expectation_name, passed_records, failed_records, dropped_records FROM event_log("<pipeline-id>") WHERE event_type = 'flow_progress' AND timestamp >= current_timestamp() - INTERVAL 7 DAYS ORDER BY timestamp DESC; Lineage shows up automatically in Unity Catalog — both the table-level edges (gold_daily_totals depends on silver_txns) and column-level edges (gold's daily_total derives from silver's amount). Operationally, this is the change that has the largest day-to-day impact: when somebody asks "what does this column mean and where did it come from," you stop having to guess. What this replaces: Great Expectations runs scheduled separately, OpenLineage stitched together by hand, and a homegrown observability dashboard reading task logs. All three of those projects either go away or shrink dramatically. 6. Migration Strategy: How Teams Actually Move Off Imperative Pipelines I've not seen a successful big-bang migration. The pattern that works is layered: Phase 1 — New pipelines only Make Lakeflow Pipelines the default for any new pipeline. This sounds obvious; the discipline is in saying no when somebody wants to add "just one more" Airflow DAG to the imperative side because it's faster this week. Phase 2 — Convert the painful ones Pick the existing pipelines that hurt the most — the ones with the longest backfill stories, the most ad-hoc quality checks, the worst lineage gaps. Those are the ones where the declarative model pays for the rewrite cost fastest. Don't start with the easy ones; their owners won't thank you for the disruption. Phase 3 — Retire the orchestration boilerplate Once a critical mass of pipelines has moved over, you can shrink (or in many cases delete) Airflow setups, custom dependency-tracking tools, and the side projects that grew up around imperative ETL. This is the phase where the cost savings actually show up in headcount and infrastructure bills. Migration step Effort Watch out for New pipelines on Lakeflow Low Team momentum — easy to revert to old patterns. Convert the top 3 painful pipelines Medium Different streaming/batch semantics in expressed dependencies. Move expectations off external DQ tools Medium Existing alerting wired to the old framework. Retire imperative orchestrator High External callers (BI tools, ML jobs) that triggered DAGs directly. 7. Where Declarative Still Hurts: Honest Limitations I'd be lying if I said this was free. The places where the declarative model still bites: Procedural logic doesn't fit. If your "pipeline" is really a sequence of API calls with branching error handling, that's a Lakeflow Job (or external code), not a declarative table.Cross-pipeline orchestration is its own thing. Lakeflow Pipelines builds the DAG inside a pipeline. If you need pipeline A to wait for pipeline B, you still need Lakeflow Jobs above them.Debugging shifts from steps to definitions. When something is wrong, you're not stepping through a script — you're reading the event log and figuring out which expectation or upstream table caused it. The tooling is good; the muscle memory is different.Cost can surprise you. Auto-scaling on a misbehaving streaming source has the same risk it always has. Set max workers thoughtfully on day one; don't leave it to defaults. Conclusion The shift to declarative pipelines isn't really about syntax. It's about who owns the boring parts. In an imperative pipeline, the team owns the order, the retries, the lineage, the quality checks, and the cluster scaling — and pays in headcount when any of those break. In a declarative pipeline, those become properties of the engine, and the team owns the part that's actually interesting: the table definitions and the business logic. Lakeflow is the cleanest implementation of that idea I've used in production, and the teams I've watched migrate haven't asked to go back.
XB Software's management team spent hours manually extracting work items (“bug fix”, “released version 1”, etc.) from dozens of developer reports. The task was repetitive, error‑prone, and a security risk when using cloud‑based AI tools, since it means exposing internal activity to external servers. To solve this, we built a local LLM‑powered agent that runs entirely on our own servers, normalizes chaotic report data, filters out useless noise, enriches descriptions from Jira, and generates a clean list of actual accomplishments. In this article, we break down the architecture and explain why a CPU‑only, on‑premise approach is practical for enterprise clients who prioritize data privacy. The Problem: Manual Work List Generation Is Slow, Inconsistent, and Insecure Usually, our managers followed the same routine: collect a month’s worth of developer reports, manually scan through hundreds of entries, and pick out the items that actually represented completed work. This process was straightforward but flawed. The first issue was data quality. Developers write reports in wildly different formats. Some include detailed Jira ticket IDs and descriptions; others are cryptic one‑liners like “fixed issue”. When a manager who wasn’t deeply involved in the project later reviews these reports, the meaning is often lost. What does “adjusted header” refer to? Which feature did “refactored code” touch? What we really needed was an AI-powered task management approach that could process this unstructured data automatically. The second issue was duplicate work. Managers would occasionally include tasks that had already been declared in previous months, creating overlaps. Another example is a task that spans several days. In this case, the same activity could be logged repeatedly, producing many near-identical entries. There was no automated way to compare new reports against historical data. The third issue was security. Initially, we experimented with feeding entire monthly reports into ChatGPT, asking it to clean up the data and suggest a final list. It worked reasonably well, but we were handing over a full month of internal project activity to a cloud service. For many enterprise businesses, especially those in finance or healthcare, that level of exposure is unacceptable. The Solution: A Secure, On‑Premise AI Agent for Task Extraction from Reports Our approach was to implement a console‑based application that converts reports into tasks automatically. It runs on our internal server, triggered by a cron job (or an optional API call) at the end of each monthly reporting cycle. The AI agent processes raw reports for each active project, applies a series of transformations, and outputs a polished list of work items. The entire pipeline runs on a CPU‑only server using Ollama to serve a local instance of the Gemma 4 E2B model. For embedding generation (used in duplicate detection), we use the tiny nomic‑embed‑text model, which is only a few megabytes in size. Here’s a high‑level view of the process flow: Let’s walk through each stage in detail. 1. Normalization: Making Chaos Readable A single project might receive 80+ individual reports per month with varying levels of detail. The first task for our AI agent was to normalize these disparate inputs into a consistent, machine‑readable format. This step alone turns a jumble of free‑form text into structured data that the rest of the pipeline can reliably process. 2. Chunking: Working Within Token Limits This is where we hit our first major technical constraint. Running on CPU via Ollama, our Gemma 4 model is limited to a context window of 4,096 tokens. That’s not a lot. A single month of reports from a busy project can easily exceed that. We solved this by chunking. The AI system calculates the approximate token count of the combined report text and splits it into batches of about 20 reports each. This ensures that the LLM never runs out of context space and that each chunk receives full attention. Within each chunk, we also further split entries that contain multiple tasks in a single line (e.g., “Did A, did B, did C”). After this splitting, 22 raw reports became 94 individual work items in one of our test runs. 3. Jira Enrichment: Adding Missing Context One of the most valuable features of our AI agent is its ability to automatically fetch additional context from Jira. When the system detects a Jira ticket ID in a report, it calls the Jira API to retrieve the ticket description. Developers often write terse reports assuming the ticket ID is enough. But when that report later appears as “AAA‑123 – done”, it tells nothing. By pulling the full, manager‑written description from Jira, our AI agent replaces the vague entry with a clear, professional summary of what was actually accomplished. 4. Filtering Out the Noise Not every report entry is worth including. Generic statements like “working on…” or “following up” don’t convey meaningful work. We built a bad‑word filter, one of the key components of our intelligent document processing (IDP) pipeline. It flags entries containing these vague phrases. The LLM processes each chunk and identifies data that match our exclusion list. In our test, this filter removed 69.1% of entries, and only 29 items out of 94 survived the cut. What remained were concrete, specific descriptions of completed tasks. 5. Selecting the Best Candidates Once we have a clean set of candidates, we need to choose the top N entries to present. The number N varies by project and is stored in our internal reporting database. To account for further filtering in the next step, we typically select a larger pool, say, 80 items. 6. Vector Duplicate Detection: Ensuring We Never Repeat Ourselves This is the secret sauce that prevents duplicate entries. Before finalizing the list, the AI agent compares each candidate against a historical database of all work items we’ve ever submitted for that project. Here’s how it works: Embedding generation. Each work item is converted into a vector (a list of numbers) using the nomic‑embed‑text model. This vector captures the semantic meaning of the text.Similarity calculation. The system compares the new candidate’s vector against the vectors of all previously stored data for that project.Threshold decision. If the similarity score exceeds 0.85 (85%), the candidate is flagged as a duplicate and removed. This threshold catches not just exact matches but also near‑duplicates where the phrasing or word order has changed while the underlying idea remains the same. The historical data is stored in a lightweight PostgreSQL table with just a few fields: project_id, text (the final description), embedding (the vector), and created_at (date of creation). After duplicate removal, we’re left with a set of truly unique, high‑quality work items. These are then formatted for final delivery to the project manager. Real‑World Performance: What Test Run Tells Us Let’s walk through an actual test run to see the numbers in action. These test run results demonstrate how an AI report analysis tool can summarize reports into tasks even with noisy, inconsistent input. StageItems inItems outreductionRaw reports22——After line splitting—94—Bad‑word filter942969.1% removedDuplicate detection291644.8% removed Technical Deep Dive: Why CPU‑Only Deployment Works One of the most common objections to running local LLMs is the perceived need for expensive GPU hardware. We deliberately chose a CPU‑only deployment to keep costs manageable and to prove that on‑premise AI doesn’t require significant infrastructure investments. Model Selection: Gemma 4 E2B We evaluated several local models and settled on Gemma 4 E2B. Here’s why: Size: At 5 billion parameters, it fits comfortably in RAM without needing a GPU. Our server has extra memory allocated specifically for the model;Performance: It’s fast enough for batch processing;Quality: The model handles JSON output reliably, and follows detailed prompts with minimal hallucination. NOTE: If you work with a multilingual team, make sure that the model you use understands target languages natively. Proper Model Settings and Prompt Engineering for Consistency Each pipeline stage has its own carefully crafted prompt that includes: A clear role definition (e.g., “You are a specialized Data Parsing Engine”);Good examples and bad examples of expected output;Explicit formatting rules (JSON structure, field names);Instructions to avoid creativity (temperature set to 0). For the bad‑word filter, we provide a list of prohibited terms and their synonyms: “working on,” “following up,” “in progress,” “discussed,” etc. The LLM simply acts as a pattern matcher with semantic understanding. It can recognize that “still working on the header” is conceptually similar to “in progress” and flag it accordingly. Also, for data‑processing tasks like this, we always disable “thinking” or “chain‑of‑thought” modes. Those are useful for complex reasoning but introduce unnecessary variability and output length in structured extraction tasks. Extra Challenges We Overcame Challenge 1: LLM unpredictability. Even with the temperature set to 0, LLMs can occasionally produce unexpected output. We added timeout limits to prevent the model from getting stuck in a loop, and we structured our prompts to request strictly formatted JSON that is easy to validate programmatically. Challenge 2: CPU processing speed. Processing 94 items across multiple LLM calls takes time. We solved this by running the AI agent as an overnight cron job, so speed is never a bottleneck. The manager arrives in the morning to a ready‑to‑review list. Why This Approach Matters for Enterprise Clients 1. Complete Data Sovereignty When you use on-premise Artificial Intelligence solutions, no data ever leaves your infrastructure. The LLM runs locally, the embedding model runs locally, and the historical database resides on your own PostgreSQL server. 2. No Vendor Lock‑In Cloud AI services change their pricing, deprecate models, or alter their APIs without notice. By using local AI agents and Ollama, you retain full control over the entire stack. Need to switch to a different model tomorrow? Just pull a new one and update the configuration. 3. Predictable Costs The only ongoing cost is the electricity to run the server. There are no per‑token API fees, no monthly subscriptions, and no surprise bills after a particularly busy month of processing. For organizations that process thousands of reports annually, the savings are substantial. 4. Customizable to Your Workflow Because we own the code, we can adapt the pipeline to fit your specific reporting format, integrate with your existing project management tools, and fine‑tune the prompts to match your industry’s terminology. This enables using AI for business process automation across diverse sectors, from construction to healthcare. From Manual Chore to Automated Precision Before, turning chaotic developer notes into clean reports meant choosing between tedious manual work and exposing sensitive data to cloud AI. Our private AI agent for document analysis offers a third way. Namely, secure, on‑premise automation. By combining Gemma 4 on standard CPU hardware with vector‑based duplicate detection and direct Jira enrichment, we’ve turned hours of monthly review into a hands‑off process. The system normalizes vague entries, filters out noise, and guarantees you never repeat a task description.
Search agents have become essential infrastructure for frontier language models, yet their development remains locked behind corporate walls. These systems need to handle a fundamentally difficult problem: given access to tools and a knowledge base, explore systematically, make smart decisions about which paths to pursue, and know when to pivot strategies. Unlike a human researcher who can draw on intuition and common sense, an LLM agent works from what it's learned during training, which means it needs explicit instruction in how to search well. The practical stakes are high. Search agents' power research tools, web-based reasoning systems, and complex information retrieval. But most breakthroughs happen inside companies with unlimited budgets. Academic researchers hit a wall: the techniques that work are proprietary, the datasets are private, and the computational resources required seem astronomical. This creates a frustrating bottleneck where innovation clusters around industrial research labs, leaving the broader research community unable to experiment, iterate, or contribute meaningfully to the field. Why Industrial Pipelines Felt Inevitable The prevailing wisdom emerged naturally from how major AI labs approached agent training. They borrowed techniques from large language model development: start with massive pre-training to build foundational knowledge, apply continuous pre-training to adapt that foundation to new domains, fine-tune on supervised examples to teach specific behaviors, then polish everything with reinforcement learning to optimize against reward signals. Each stage supposedly unlocks something the previous stage couldn't reach. The logic seemed bulletproof. If you want frontier-level capabilities, you need frontier-level methods and resources. Pre-training builds knowledge. Continuous pre-training specializes it. Supervised fine-tuning teaches specific skills. Reinforcement learning optimizes for actual performance. Remove any link in this chain, and you'd expect degradation. This assumption led to a clear conclusion: building state-of-the-art search agents required industrial-scale infrastructure. Tongyi DeepResearch, for example, achieved strong performance through exactly this pipeline, spending enormous computational resources across all four optimization stages. For any academic team or resource-constrained organization, this seemed like an insurmountable barrier. The Dataset Design Revolution Then came a simpler observation: what if the bottleneck wasn't the algorithm, but what data you fed it? The researchers behind OpenSeeker-v2 noticed something crucial. Most work on agent training focused on optimization techniques, assuming the data was a fixed quantity. But what if the data itself could be fundamentally restructured? What if you could take the same training paradigm (simple supervised fine-tuning) and make it exponentially more powerful just by changing which trajectories you used as examples? This insight reframes the entire problem. Instead of asking "how do we squeeze more signal out of expensive optimization," ask "what makes a trajectory worth learning from?" Some trajectories teach the agent to think strategically. Others are lucky guesses that teach nothing. Some expose the agent to decision points where multiple tools could apply. Others are straightforward execution of a predetermined path. The team introduced three modifications to their data synthesis process, each targeting a specific dimension of training data quality. Scaling the knowledge graph means agents encounter richer search spaces during training. Instead of a small, constrained domain, they face larger graphs with more branches and exploration options. This prevents agents from memorizing solutions and forces them to develop genuine decision-making principles. A larger knowledge graph means each training trajectory involves more meaningful choices. Expanding the tool set requires agents to learn judgment. When an agent has only a few tools, it can succeed through trial and error on the same limited options. With a larger toolkit, the agent must actually reason about which tool fits which problem. This teaches generalization rather than reflexes. The agent learns principles of tool selection instead of pattern-matching to familiar scenarios. Strict low-step filtering focuses on trajectories that require careful planning rather than lucky guesses. A trajectory solving a problem in two steps teaches little about strategic reasoning. A trajectory requiring eight thoughtful steps teaches the agent to think systematically. By filtering strictly for solutions requiring multiple steps, researchers ensured every example was a lesson in strategic thinking, not an accident. Figure 1: OpenSeeker-v2 achieves state-of-the-art performance within its model scale and paradigm on four representative benchmarks, remarkably accomplishing this via simple SFT and outperforming Tongyi DeepResearch that is trained via extensive optimization pipelines The result was deceptively small: 10.6k training examples. This number matters precisely because it seems impossible. A pre-trained language model might use billions of tokens. Industrial fine-tuning typically involves hundreds of thousands of examples. Yet 10.6k examples, when carefully structured around these three principles, proved sufficient to outperform systems trained with vastly more data and computational resources. Figure 2: Comparison of average tool call counts across search-agent training data, showing how OpenSeeker-v2 training forced more extensive exploration than baseline datasets Testing Against the Real Competition Theory means nothing without empirical validation. The team tested OpenSeeker-v2 against standardized benchmarks where it faced comparison with systems trained using industrial pipelines. On BrowseComp, a benchmark testing web search and reasoning about real-time information, OpenSeeker-v2 achieved 46.0% accuracy compared to Tongyi DeepResearch's 43.4%. On BrowseComp-ZH, the same benchmark in Chinese, the gap widened to 58.1% versus 46.7%, demonstrating superior generalization across languages. On Humanity's Last Exam, a genuinely difficult benchmark requiring deep reasoning, OpenSeeker-v2 scored 34.6% to Tongyi's 32.9%. On xbench, a comprehensive benchmark of search capabilities, the difference was 78.0% versus 75.0%. These aren't marginal victories achieved through luck or benchmark overfitting. They're consistent wins across diverse evaluation metrics, with particularly striking results on the multilingual benchmark. A 30B model trained only with supervised fine-tuning on 10.6k examples outperformed a system built with "heavy CPT+SFT+RL pipeline," to quote the paper's own comparison. The significance of this finding inverts the conventional hierarchy. In AI development, more resources usually beat fewer resources. Better optimization techniques usually beat simpler ones. Yet here, a system constrained by deliberate dataset curation beat a system built with computational abundance. This suggests the constraint wasn't actually computational or algorithmic at all. It was conceptual, understanding what makes training data actually teach something valuable. Why This Opens Doors for Everyone The deeper implication extends beyond OpenSeeker-v2's specific numbers. The research demonstrates that a different path to frontier capabilities exists. Industrial teams with unlimited budgets can always outspend competitors. But a discovery that "data curation beats computational resources" shifts the entire economic structure of AI development. If you're thoughtful about which 10,000 examples you use, you don't need billion-dollar infrastructure. You need domain expertise, careful thinking, and clear principles about dataset design. This is something accessible to academic teams, startups, and researchers in resource-constrained regions. The work also sits in a broader context. Earlier approaches like OpenResearcher explored fully open pipelines for agent research, while Points Seeker examined multimodal search agents. OpenSeeker-v2's contribution is orthogonal: it shows that even within simpler architectures and paradigms, strategic dataset design enables frontier performance. This connects to broader observations about deep information seeking, suggesting that search capability improvements come from better data and clearer reasoning structures, not just more compute. Accessibility matters here because it enables reproducibility. Unlike industrial systems trained with proprietary methods on private data, OpenSeeker-v2 is open-sourced with transparent methodology. The community can examine it, build on it, and improve the dataset design principles. This creates a feedback loop where the field collectively discovers what makes training data valuable. The research also opens new questions. Can these curation principles apply to other domains beyond search agents? Does data quality multiply the efficiency of any LLM training task? Could other research groups develop improved versions of OpenSeeker-v2 by applying fresh insights about trajectory design? These questions now seem answerable rather than theoretical. Most importantly, the work reshapes how the field thinks about scaling. Sometimes the bottleneck in AI development isn't algorithmic innovation or computational power. It's understanding what signal matters most. OpenSeeker-v2 teaches that lesson in a way the broader research community can actually apply, not as a one-off engineering achievement but as a principle about how to think about training data.
We all have that daily routine: opening a dozen browser tabs to check the health and progress of our favorite open-source projects. For me, it’s keeping a close eye on rapidly evolving ecosystems like Docling and the watsonx Agent Development Kit (ADK). Eventually, the manual refreshing had to stop. I decided to build a custom application to automate this workflow — or more accurately, a dedicated Agent. Before you write off “Agent” as just another industry buzzword, consider this: true agency isn’t just about complex LLM reasoning; it’s about autonomous execution. An agent bridges the gap between manual human effort and automated consistency, stepping in to handle what used to require our click-by-click attention. Here is how I built an automated companion to keep my pulse on the tech stacks that matter: by taking over the repetitive task of repository tracking, this tool operates as a functional agent in my development ecosystem. In this post, I’ll break down how it works and how you can implement it. Implementation In the following section, I’ll walk through the building block of the agent. Building Blocks: The Tech Stack To keep the footprint light, local, and efficient, the tool is built on a streamlined, minimal-dependency stack: Python 3: Handles the core application logic, parsing repository data, and orchestrating updates.SQLite: Acts as a lightweight, serverless database engine to persist repository states and track changes between runs.Bash: Bridges the application and the operating system, wrapping the execution logic into a clean, reproducible script.macOS & cron: Leverages native system utilities to handle automation and schedule regular execution intervals without relying on heavy third-party orchestrators. The Core Application Markdown github-check/ ├── github_monitor.py # Main monitoring application ├── web_viewer.py # Web dashboard application (Flask) ├── github_monitor.db # SQLite database (auto-created) ├── requirements.txt # Python dependencies (requests, flask) ├── .gitignore # Git ignore rules (filters .env, _* folders) ├── .gitattributes # Git attributes configuration ├── LICENSE # Project license ├── README.md # User documentation with diagrams │ ├── Docs/ │ ├── Architecture.md # This file - Technical architecture │ └── WebViewer.md # Web dashboard documentation │ ├── scripts/ │ ├── schedule_monitor.sh # Cron scheduler script │ ├── github-push.sh # Git push automation script │ ├── killer-port.sh # Port management utility │ └── hard-killer-port.sh # Force kill port utility │ ├── input/ │ └── repositories.txt # Repository list (owner/repo format) │ ├── output/ │ ├── logs/ # Execution logs (from cron) │ │ └── YYYYMMDD_HHMMSS_monitor.log │ └── YYYYMMDD_HHMMSS_report.txt # Generated reports │ ├── templates/ │ └── index.html # Web dashboard HTML template │ └── static/ ├── css/ │ └── style.css # Dashboard styles (dark theme) └── js/ └── app.js # Dashboard JavaScript (Chart.js, API calls) Core Initialization and State Management The application uses an object-oriented approach via the GitHubMonitor class. Upon instantiation, it handles its own database initialization using sqlite3. It creates two core tables—repositories and updates—utilizing indexes on frequently queried fields (repo_name and update_timestamp) to ensure quick lookups as your monitored list grows. Python def _init_database(self): """Initialize SQLite database with required schema.""" conn = sqlite3.connect(self.db_path) cursor = conn.cursor() cursor.execute(''' CREATE TABLE IF NOT EXISTS repositories ( id INTEGER PRIMARY KEY AUTOINCREMENT, repo_name TEXT UNIQUE NOT NULL, first_checked_at TEXT NOT NULL, last_checked_at TEXT NOT NULL ) ''') # ... updates table creation omitted for brevity ... cursor.execute(''' CREATE INDEX IF NOT EXISTS idx_repo_name ON repositories(repo_name) ''') conn.commit() conn.close() Resilient API Communication To interface with GitHub, the application utilizes a persistent requests.Session(). It is designed to safely handle unauthenticated requests while seamlessly embedding a personal access token (GITHUB_TOKEN) from the environment variables to bypass restrictive API rate limits. It also includes explicit HTTP status error handling (like 403 for rate limits and 404 for missing repos) alongside network timeout guards. Python self.github_token = os.getenv('GITHUB_TOKEN') # Optional: for higher rate limits self.session = requests.Session() if self.github_token: self.session.headers.update({'Authorization': f'token {self.github_token}'}) # ... Inside _get_repo_info ... response = self.session.get(url, timeout=10) if response.status_code == 200: return response.json() elif response.status_code == 403: print(f"✗ Rate limit exceeded. Consider using GITHUB_TOKEN environment variable.") return None Delta Detection Logic The core engine reads target repositories from a flat file (ignoring comments and whitespace) and loops through them. For each repository, it extracts the API’s pushed_at timestamp. It then checks the database to determine if the repository is brand new or if the remote timestamp differs from the last_checked state inside the DB, validating it against a configurable sliding time window (check_days). Python # Check if repo is in database exists, repo_id, last_checked = self._is_repo_in_db(repo_name) if not exists: # First time seeing this repo repo_id = self._add_repository(repo_name, pushed_at) self._log_update(repo_id, repo_name, pushed_at, is_first_run=True) else: # Check if there's a recent update and if it's a new update since last check if self._has_recent_update(pushed_at): if pushed_at != last_checked: self._log_update(repo_id, repo_name, pushed_at, is_first_run=False) print(f" UPDATE DETECTED!") Automated Auditing and Reporting Beyond real-time monitoring stdout logs, the application aggregates state tracking into a clean historical markdown-style report. It runs complex SQL joins to count the frequency of updates per repository and isolates the latest ten global changes. The system automatically creates a dedicated output/ directory and writes time-stamped files to ensure snapshots are preserved for long-term auditing. Python # Get all repositories with aggregated update counts cursor.execute(''' SELECT r.repo_name, r.first_checked_at, r.last_checked_at, COUNT(u.id) as update_count FROM repositories r LEFT JOIN updates u ON r.id = u.repo_id GROUP BY r.id ORDER BY r.repo_name ''') # ... Report file generation ... if output_file: timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S") output_path = f"output/{timestamp}_{output_file}" os.makedirs("output", exist_ok=True) with open(output_path, 'w') as f: f.write(report) The Bash Script Hereafter the schedule_monitor.sh bash script, which prepares, executes, and maintains the automated tracking application. Dynamic Path Resolution Instead of relying on rigid, hardcoded absolute paths, the script begins by dynamically resolving its own location relative to the filesystem. By using dirname and the BASH_SOURCE environment variable, it anchors itself securely to the project layout. This ensures that no matter where the cron daemon triggers the script from, it can always accurately find the target Python application (github_monitor.py) and establish a consistent execution working directory. Automated Logging and Diagnostics Because a background cron job runs without a visual terminal (stdout), tracking down execution errors requires an audit trail. The script handles this by isolating a dedicated logs directory (output/logs) and utilizing a date-and-time string (date +"%Y%m%d_%H%M%S") to generate a unique file for every single runtime iteration. It appends clear timestamp banners marking exactly when a check started and concluded. Environment Validation and Execution Before attempting to launch the monitor, the script safely checks the host machine’s environment for valid runtimes. It runs a quiet check (command -v) to see if python3 or a fallback python command is accessible. If a Python binary is found, it triggers the underlying script, passing down the configurable time-window argument (--days 1) while explicitly routing both standard output and potential error stack traces (2>&1) straight into the active log file. Self-Cleaning Log Retention Running automated tasks indefinitely carries the risk of slowly cluttering local storage with thousands of historical text files. To enforce clean housekeeping, the script concludes its run with an automated garbage-collection routine. It uses the native Unix find command to scan the log directory, isolates any tracking logs older than 30 days (-mtime +30), and automatically purges them from the system. Shell #!/bin/bash # GitHub Repository Monitor Scheduler # This script can be used with cron to schedule regular checks # Configuration SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" PROJECT_DIR="$(dirname "$SCRIPT_DIR")" PYTHON_SCRIPT="$PROJECT_DIR/github_monitor.py" LOG_DIR="$PROJECT_DIR/output/logs" CHECK_DAYS=1 # Create log directory if it doesn't exist mkdir -p "$LOG_DIR" # Generate timestamp for log file TIMESTAMP=$(date +"%Y%m%d_%H%M%S") LOG_FILE="$LOG_DIR/${TIMESTAMP}_monitor.log" # Run the monitor and log output echo "=== GitHub Monitor Run: $(date) ===" >> "$LOG_FILE" cd "$PROJECT_DIR" || exit 1 # Check if Python 3 is available if command -v python3 &> /dev/null; then PYTHON_CMD="python3" elif command -v python &> /dev/null; then PYTHON_CMD="python" else echo "Error: Python not found" >> "$LOG_FILE" exit 1 fi # Run the monitor $PYTHON_CMD "$PYTHON_SCRIPT" --days "$CHECK_DAYS" >> "$LOG_FILE" 2>&1 # Log completion echo "=== Completed: $(date) ===" >> "$LOG_FILE" echo "" >> "$LOG_FILE" # Optional: Keep only last 30 days of logs find "$LOG_DIR" -name "*.log" -type f -mtime +30 -delete exit 0 # Made with Bob TL;DR: How to Make a Cron Job on a macOS Machine? There are several ways to do this on a macOS (my machine). The Modern macOS Way (launchd) launchd uses .plist (XML) files to manage schedules. It feels a bit wordier than cron, but it’s the most reliable method for Mac. Create a .plist file: open your terminal or a text editor and create a file in ~/Library/LaunchAgents/. Let's call it com.user.myjob.plist. Add the configuration: paste the following XML into the file. This example is set to run a script every day at 10:30 PM (22:30). XML <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"> <plist version="1.0"> <dict> <key>Label</key> <string>com.user.myjob</string> <key>ProgramArguments</key> <array> <string>/Users/yourusername/scripts/myscript.sh</string> </array> <key>StartCalendarInterval</key> <dict> <key>Hour</key> <integer>22</integer> <key>Minute</key> <integer>30</integer> </dict> <key>StandardOutPath</key> <string>/tmp/myjob.out</string> <key>StandardErrorPath</key> <string>/tmp/myjob.err</string> </dict> </plist> Load and start the job: in the Terminal, tell macOS to look at the new file and start scheduling it: Shell launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.user.myjob.plist If you need to stop it or unload or cancel the job, run: launchctl bootout gui/$(id -u) ~/Library/LaunchAgents/com.user.myjob.plist The Classic Way (cron) If you prefer the classic Linux/Unix crontab style because you already know the syntax, macOS can still do it. Open the crontab editor (in the terminal, and you’ll get something like vim); Shell crontab -e Add your cron syntax: add the job using the standard 5-asterisk cron formatting. For example, to run a script every day at midnight: Shell 0 0 * * * /Users/yourusername/scripts/myscript.sh Save and exit! The Crucial macOS Step for Cron Because of macOS security restrictions, cron will often fail silently because it doesn’t have permission to access your files. You have to grant it access: Open System Settings > Privacy & Security > Full Disk Access.Click the + icon.Press Cmd + Shift + G and type /usr/sbin/cron, then hit enter.Toggle the switch to On for cron. Which one should to choose? Use launchd if you want your job to reliably run even if your MacBook lid was closed/asleep at the exact minute it was scheduled to trigger. Use cron if you just need something quick and familiar for a desktop Mac that is always awake. The Database (SQLite) The repositories Table This table acts as the registry for the GitHub repositories you choose to track. It records when a repository was first introduced to the monitor and mirrors its remote state by tracking the latest push timestamp. id (INTEGER PRIMARY KEY AUTOINCREMENT): Unique internal identifier for each repository, used as the primary key.repo_name (TEXT UNIQUE NOT NULL): The full GitHub identifier in the owner/repository format (e.g., IBM/watsonx-adk or DSUR/docling). The UNIQUE constraint guarantees that a repository cannot be duplicated in the registry.first_checked_at (TEXT NOT NULL): An ISO 8601 UTC timestamp capturing the exact moment the repository was first indexed by your application.last_checked_at (TEXT NOT NULL): Stores the latest pushed_at timestamp fetched from the GitHub API. This field is overwritten whenever a new delta/update is detected, serving as the benchmark for future comparisons. The updates Table This table functions as a historical append-only ledger. Every time the tool encounters a change (or indexes a repository for the first time), it appends a record here, creating a reliable audit trail of project activity. id (INTEGER PRIMARY KEY AUTOINCREMENT): Unique identifier for each specific update record.repo_id (INTEGER NOT NULL): Foreign key referencing repositories(id), establishing a 1:N relationship (one repository can have many logged updates).repo_name (TEXT NOT NULL): Denormalized repository name to allow quick querying of logs without mandatory joins.update_timestamp / pushed_at (TEXT NOT NULL): The pushed_at timestamp provided directly by the GitHub API API, indicating when the remote change actually occurred.check_timestamp (TEXT NOT NULL): An ISO 8601 UTC timestamp capturing when your local agent executed and caught the update.is_first_run (BOOLEAN NOT NULL): A flag (0 or 1) tracking whether this log entry represents the initial discovery of the repository or a subsequent update. Relationship Diagram The database structure relies on standard relational integrity: Optimization Indexes To prevent execution slowdowns as your tracking history grows over months of automated cron cycles, the database explicitly initializes two performance indexes: idx_repo_name on repositories(repo_name): Pre-sorts rows by repository name. This ensures that when the application calls _is_repo_in_db() to check if a project exists, SQLite performs an O(logn) binary search instead of an expensive O(n) full-table scan.idx_update_timestamp on updates(update_timestamp): Optimizes time-series queries, sorting updates by their timestamps to speed up reports or dashboards isolating recent changes. Data Storage Details Serverless and Local: Because SQLite is an in-process library, the entire database is stored as a single, ordinary cross-platform file (github_monitor.db) directly within your project directory.Dynamic Typing (Storage Classes): SQLite uses dynamic type affinity. While the schema declares standard SQL types like TEXT and BOOLEAN, dates are stored as ISO 8601 text strings. Booleans are managed natively by SQLite as integers (0 for false, 1 for true). The User Interface to Monitor the Results and Access the Repositories Markdown # web_viewer.py Flask App ├── Routes │ ├── index() -> Dashboard HTML │ ├── get_stats() -> Statistics JSON │ ├── get_repositories() -> Repositories JSON │ ├── get_updates() -> Updates JSON │ ├── get_timeline() -> Timeline JSON │ └── get_repository_details(id) -> Repository JSON │ ├── Utilities │ ├── get_db_connection() -> SQLite connection │ └── format_timestamp() -> Formatted date string │ └── Configuration ├── DB_PATH = 'github_monitor.db' ├── HOST = '127.0.0.1' └── PORT = 5001 Beyond the headless automation, the application features a clean, intuitive UI that serves as your central command center. This dashboard provides a crystal-clear visual overview of every repository currently being tracked by the agent. Instead of parsing raw database rows, you can audit your entire tech stack at a glance and see exactly what’s under watch. Even better, it collapses the distance between discovery and action: with a single click inside the UI, you can jump directly to any chosen repository on GitHub the moment you want to investigate a new change. Python #!/usr/bin/env python3 """ GitHub Monitor Web Viewer A simple Flask-based web application to visualize SQLite database data. """ from flask import Flask, render_template, jsonify import sqlite3 from datetime import datetime import os app = Flask(__name__) # Configuration DB_PATH = 'github_monitor.db' def get_db_connection(): """Create a database connection.""" conn = sqlite3.connect(DB_PATH) conn.row_factory = sqlite3.Row return conn def format_timestamp(ts_str): """Format ISO timestamp to readable format.""" try: if 'T' in ts_str: dt = datetime.fromisoformat(ts_str.replace('Z', '+00:00')) return dt.strftime('%Y-%m-%d %H:%M:%S UTC') return ts_str except: return ts_str @app.route('/') def index(): """Main dashboard page.""" return render_template('index.html') @app.route('/api/stats') def get_stats(): """Get overall statistics.""" conn = get_db_connection() cursor = conn.cursor() # Total repositories cursor.execute('SELECT COUNT(*) as count FROM repositories') total_repos = cursor.fetchone()['count'] # Total updates cursor.execute('SELECT COUNT(*) as count FROM updates') total_updates = cursor.fetchone()['count'] # Updates today cursor.execute(''' SELECT COUNT(*) as count FROM updates WHERE date(check_timestamp) = date('now') ''') updates_today = cursor.fetchone()['count'] # Most active repository cursor.execute(''' SELECT repo_name, COUNT(*) as update_count FROM updates GROUP BY repo_name ORDER BY update_count DESC LIMIT 1 ''') most_active = cursor.fetchone() conn.close() return jsonify({ 'total_repos': total_repos, 'total_updates': total_updates, 'updates_today': updates_today, 'most_active': dict(most_active) if most_active else None }) @app.route('/api/repositories') def get_repositories(): """Get all repositories with their update counts.""" conn = get_db_connection() cursor = conn.cursor() cursor.execute(''' SELECT r.id, r.repo_name, r.first_checked_at, r.last_checked_at, COUNT(u.id) as update_count FROM repositories r LEFT JOIN updates u ON r.id = u.repo_id GROUP BY r.id ORDER BY r.repo_name ''') repos = [] for row in cursor.fetchall(): repos.append({ 'id': row['id'], 'repo_name': row['repo_name'], 'first_checked_at': format_timestamp(row['first_checked_at']), 'last_checked_at': format_timestamp(row['last_checked_at']), 'update_count': row['update_count'] }) conn.close() return jsonify(repos) @app.route('/api/updates') def get_updates(): """Get recent updates.""" limit = 50 conn = get_db_connection() cursor = conn.cursor() cursor.execute(''' SELECT id, repo_name, update_timestamp, check_timestamp, is_first_run FROM updates ORDER BY check_timestamp DESC LIMIT ? ''', (limit,)) updates = [] for row in cursor.fetchall(): updates.append({ 'id': row['id'], 'repo_name': row['repo_name'], 'update_timestamp': format_timestamp(row['update_timestamp']), 'check_timestamp': format_timestamp(row['check_timestamp']), 'is_first_run': bool(row['is_first_run']) }) conn.close() return jsonify(updates) @app.route('/api/repository/<int:repo_id>') def get_repository_details(repo_id): """Get detailed information about a specific repository.""" conn = get_db_connection() cursor = conn.cursor() # Get repository info cursor.execute('SELECT * FROM repositories WHERE id = ?', (repo_id,)) repo = cursor.fetchone() if not repo: conn.close() return jsonify({'error': 'Repository not found'}), 404 # Get updates for this repository cursor.execute(''' SELECT * FROM updates WHERE repo_id = ? ORDER BY check_timestamp DESC ''', (repo_id,)) updates = [] for row in cursor.fetchall(): updates.append({ 'id': row['id'], 'update_timestamp': format_timestamp(row['update_timestamp']), 'check_timestamp': format_timestamp(row['check_timestamp']), 'is_first_run': bool(row['is_first_run']) }) conn.close() return jsonify({ 'repository': { 'id': repo['id'], 'repo_name': repo['repo_name'], 'first_checked_at': format_timestamp(repo['first_checked_at']), 'last_checked_at': format_timestamp(repo['last_checked_at']) }, 'updates': updates }) @app.route('/api/timeline') def get_timeline(): """Get update timeline data for visualization.""" conn = get_db_connection() cursor = conn.cursor() cursor.execute(''' SELECT date(check_timestamp) as date, COUNT(*) as count FROM updates GROUP BY date(check_timestamp) ORDER BY date DESC LIMIT 30 ''') timeline = [] for row in cursor.fetchall(): timeline.append({ 'date': row['date'], 'count': row['count'] }) conn.close() return jsonify(timeline) if __name__ == '__main__': if not os.path.exists(DB_PATH): print(f"Error: Database file '{DB_PATH}' not found!") print("Please run github_monitor.py first to create the database.") exit(1) print("=" * 60) print("GitHub Monitor Web Viewer") print("=" * 60) print(f"Database: {DB_PATH}") print("Starting server...") print("Open your browser at: http://localhost:5001") print("Press Ctrl+C to stop") print("=" * 60) # Use port 5001 to avoid macOS AirDrop conflict on port 5000 app.run(debug=True, host='127.0.0.1', port=5001) # Made with Bob So at the end we get; Centralized watchlist: View all monitored repositories instantly in a clean, human-readable dashboard rather than querying the SQLite tables directly.One-click navigation: Every tracked repository in the UI functions as an active shortcut — clicking a project immediately takes you directly to its GitHub page to review the latest commits or releases. Configured via Plain Text: Simple and Source-Controlled The repository watchlist is intentionally kept detached from the core code, stored in a flat, human-readable text file named repositories.txt. This design embraces a "configuration-as-code" philosophy: you don't need to write SQL queries or modify Python variables just to change what you track. You simply list the targets in a standard owner/repo format, one per line. The application’s parser is built to be forgiving and clean, automatically skipping empty lines and stripping out any lines prefixed with a #. This allows you to organize your watchlist with custom sections, leave developer notes, or temporarily comment out a project without losing track of it. Markdown # GitHub Repositories to Monitor # Format: owner/repo (one per line) # Lines starting with # are comments and will be ignored # Example repositories for testing: torvalds/linux microsoft/vscode python/cpython # Add your repositories below: docling-project/docling ibm/ibm-watsonx-orchestrate-adk ibm/mcp-context-forge generative-computing/mellea containers/podman podman-desktop/podman-desktop Conclusion: From Concept to Production in 30 Minutes What started as a simple, repetitive kind of daily habit — manually refreshing browser tabs to check for updates on critical frameworks like Docling and the watsonx Agent Development Kit — has been transformed into a fully automated, local developer ecosystem. By decoupling the watchlist into a frictionless, plain-text configuration file and leveraging a robust Python engine paired with an internal SQLite state ledger, the project eliminates human overhead entirely. With an OS-native cron scheduler handling the heavy lifting in the background and a sleek user interface providing one-click navigation to the source, the tool serves as a functional, autonomous agent that keeps my development workflow perfectly synchronized with the open-source world. The most remarkable aspect of this project, however, wasn’t just the architecture — it was the velocity. By collaborating with IBM Bob as an AI-driven development partner, the entire lifecycle of this tool moved from ideation to a production-ready implementation in exactly 30 minutes. From initializing the database schemas and crafting resilient API delta logic to wrapping the application in a self-cleaning bash scheduler, Bob industrialized the code creation process seamlessly. It is a powerful testament to how modern, spec-driven prototyping can compress days of development overhead into a single focused, half-hour session, delivering immediate architectural value without the bloat. That’s a wrap! Links Blog post code repository: https://github.com/aairom/github-checkIBM Bob: https://bob.ibm.com/
Building scalable data systems often feels like navigating an endless sea of shifting paradigms. Engineers and architects are constantly forced to choose between centralizing data or distributing it, processing in batches or streaming in real time, and enforcing strict compliance or enabling rapid self-service analytics. Without a structured taxonomy, engineering teams risk building fragmented pipelines that accumulate technical debt. The following comprehensive blueprint serves as a definitive Data Patterns and Practices Library to help you align your infrastructure with proven engineering methodologies. Architectural Patterns Data lake: A centralized repository that allows storing structured and unstructured data at any scale, enabling raw data storage for various analytics purposes.Data warehouse: A large, centralized repository for storing and managing structured data, optimized for high-performance analytics and reporting.Lambda architecture: A data processing architecture that combines batch and stream processing for fault-tolerant, scalable, and real-time data analytics.Kappa architecture: A data processing architecture that simplifies Lambda Architecture by only using stream processing for both real-time and historical data.Microservices architecture: A design approach that structures applications as a collection of small, independently deployable services, allowing for greater flexibility and scalability.Event-driven architecture: A software design pattern that promotes the production, detection, and reaction to events, enabling loose coupling and high scalability in distributed systems.Polyglot persistence architecture: A data storage strategy that uses multiple types of databases to store and manage data according to its specific needs.Data mesh: A decentralized approach to data architecture focusing on domain-oriented data ownership, self-serve data infrastructure, and product-oriented data delivery.Data vault: A hybrid data modeling and storage methodology that combines aspects of 3NF and star schema to create a scalable, flexible, and auditable solution.Streaming-first: An approach that prioritizes real-time data processing and analysis utilizing event streaming technologies. Storage Patterns Sharding: A method of distributing data across multiple database servers to improve performance and scalability.Partitioning: The process of dividing a large table into smaller, more manageable pieces to improve query performance.Replication: The process of copying data from one database to another to ensure availability, redundancy, and load balancing.Federated storage: A storage architecture that integrates multiple storage systems under a unified management framework.Object storage: A scalable architecture that manages data as objects rather than files or blocks, providing high performance for unstructured data.Columnar storage: A format that stores data by column rather than row, which is particularly suited for analytics workloads.Time-series: A specialized storage system designed to handle time-stamped data, such as sensor data or stock prices, efficiently.Graph storage: A system optimized for storing and querying graph data, representing entities and their relationships in an interconnected structure.In-memory storage: A storage architecture that stores data in RAM instead of on disk for significantly faster processing.Hybrid storage: A solution that combines different storage types, such as on-premises and cloud, to optimize cost and performance. Integration Patterns Extract, transform, load (ETL): A process of extracting data from source systems, transforming it, and loading it into a target system.Extract, load, transform (ELT): A variation of ETL where data is first loaded into the target system and then transformed using the target's processing power.Change data capture (CDC): A technique for capturing and processing changes in source data to enable incremental updates to target systems.Data federation: A technique for integrating data from disparate sources without physically moving or copying it, providing a unified view.Data visualization: An approach that abstracts underlying data sources, allowing users to access and manipulate data without knowing its physical location.Data replication: The process of copying data from one database to another to ensure data availability and redundancy.Data synchronization: The process of keeping data in multiple locations consistent and up-to-date by propagating changes.Data preparation: The process of cleaning, transforming, and enriching data to make it suitable for analysis or processing.Publish/subscribe: A messaging pattern that decouples data producers and consumers using an intermediary message broker.Request/reply pattern: A messaging pattern where a data consumer sends a request and waits for a response, allowing for synchronous communication. Data Analytics Descriptive analytics: The analysis of historical data to understand past events and trends, often presented through reports or dashboards.Diagnostic analytics: The process of examining data to determine the causes of past events using techniques like data mining or correlations.Predictive analytics: The use of data, statistical algorithms, and machine learning to predict future events based on historical data.Prescriptive analytics: The process of recommending actions or decisions based on data analysis using optimization or simulation algorithms.Real-time analytics: The analysis of data as it is generated or received to provide immediate insights and rapid decision-making.Batch analytics: The processing and analysis of large volumes of data in batches, often scheduled at regular intervals.Text analytics: The process of extracting meaningful information from unstructured text using natural language processing.Geospatial analytics: The analysis of geographically referenced data to interpret spatial relationships and patterns.Sentiment analytics: A technique using NLP to determine the sentiment or emotion expressed in textual data.Network analytics: The analysis of network data to uncover patterns and interactions between nodes (entities) in a network. Data Management Master data management (MDM): The process of creating a single, authoritative source of truth for critical business data.Reference data management (RDM): The practice of managing shared data (like codes or categories) used across multiple systems for consistency.Metadata management: The process of creating and maintaining data about data to facilitate discovery and governance.Data catalog: A searchable inventory of an organization's data assets, including datasets and reports.Data lineage: The practice of tracking the flow of data through systems, including its origin and transformations.Data versioning: The process of tracking and managing changes to data over time for recovery and auditing.Data performance: The process of documenting the origin, history, and processing of data to ensure trustworthiness and traceability.Data lifecycle management: A comprehensive approach to managing data from creation to archival or deletion.Data virtualization: A technique that abstracts underlying data sources to allow access without knowledge of physical location or structure.Data profiling: The process of assessing data quality by collecting statistics and identifying patterns or anomalies. Data Governance Data stewardship: The practice of overseeing an organization's data to ensure quality, consistency, and compliance.Data quality management: The process of measuring and improving the accuracy, completeness, and consistency of data.Data policy management: The development and enforcement of standards and procedures that govern data use.Data classification: The process of categorizing data based on sensitivity or risk to implement appropriate security measures.Data retention and archival: Defining policies for storing and disposing of data based on legal and business requirements.Data privacy compliance: Ensuring data practices adhere to laws and regulations like GDPR or CCPA.Data lineage and provenance: Tracking the origin and flow of data through systems to ensure accuracy and compliance.Data cataloging and discovery: Maintaining a searchable repository that provides an inventory of an organization's data assets.Data risk management: Identifying and mitigating data-related risks such as breaches or corruption.Data ownership: Assigning accountability for data assets to specific individuals or teams to ensure proper management. Data Security Data encryption: Encoding data to protect it from unauthorized access both at-rest and in-transit.Data masking: Obscuring sensitive data by replacing it with fictitious data to prevent exposure to unauthorized users.Data tokenization: Substituting sensitive data with non-sensitive tokens while still enabling some operations and analytics.Data access control: Defining policies that determine who can access or modify data based on roles and security requirements.Data auditing: Monitoring and recording data activities to detect unauthorized access or compliance violations.Data anonymization: Removing personally identifiable information (PII) from datasets to protect individual privacy.Data pseudonymization: Replacing sensitive data with artificial identifiers to reduce re-identification risk.Data security monitoring: Continuously analyzing systems and networks for potential security threats or breaches.Data activity monitoring: Continuous analysis of database transactions to detect unauthorized access or policy violations.Data loss prevention: Tools and practices designed to protect sensitive data from unauthorized leakage or theft. Key Use Cases and Architectural Examples 1. Real-Time Distributed Processing for High-Velocity Streams For platforms requiring immediate analytical insights, minimizing architectural complexity while handling large-scale data streams is a primary challenge. Core patterns: Kappa Architecture, Streaming-first, and In-Memory Storage. Production tech stack: Apache Kafka, PySpark, Structured Streaming, and Redis. Specific example: In a high-volume financial transaction system, implementing a Kappa Architecture simplifies the processing pipeline by routing both real-time logs and historical data events through a single stream engine. By prioritizing a streaming-first approach using an Apache Kafka cluster, the platform eliminates the complex dual-pipeline maintenance found in traditional Lambda setups. A PySpark Structured Streaming application consumes these event streams directly, executing stateful window transformations on the fly. To achieve microsecond latency for immediate fraud lookups, the working state or frequently queried reference tables are held in an In-Memory Storage layer like Redis, ensuring rapid access speeds that disk-based alternatives cannot match. 2. Decentralized Architecture for Enterprise Scaling Large organizations often face engineering bottlenecks when a single, centralized team manages a massive monolithic data lake. Core patterns: Data Mesh, Data Governance, and Data Cataloging and Discovery. Production tech stack: Databricks Unity Catalog, AWS Lake Formation, and Snowflake Data Sharing. Specific example: A multi-national banking entity transitions to a Data Mesh framework, shifting data asset ownership away from a centralized team to domain-oriented groups, such as Risk Modeling and Retail Analytics, which deliver data as independent products. To maintain unified compliance, the infrastructure relies on strict Data Governance policies managed through Databricks Unity Catalog and AWS Lake Formation, enforcing centralized data stewardship, role-based access control, and automated data classification. These localized datasets are then securely exposed across departments via Snowflake Data Sharing. A centralized Data Catalog runs continuously on top of these endpoints, providing developers across the entire enterprise a single, searchable inventory to securely discover, audit, and consume cross-domain data products. 3. High-Performance Cloud Analytics and Reporting To optimize modern cloud infrastructure, data pipelines must maximize query performance while containing compute and storage costs. Core patterns: Extract, Load, Transform (ELT) and Columnar Storage. Production tech stack: dbt (Data Build Tool), Delta Lake, Snowflake, and Apache Spark. Specific example: A modern enterprise analytics platform ingests massive volumes of raw operational data into cloud object storage, choosing a flexible ELT pipeline over traditional ETL frameworks. Raw files are loaded directly into a target data platform like Snowflake or Databricks Delta Lake, leveraging cloud elasticity to execute complex transformations post-load using dbt or optimized Spark SQL queries. To maximize business intelligence performance, the underlying files are stored using highly optimized Columnar Storage formats like Parquet. This structures data by column rather than row, ensuring that analytical queries only read the specific columns requested for a report. This optimization cuts down disk I/O operations and speeds up complex calculations across billions of historical records. Conclusion Successfully implementing a modern data infrastructure is never about finding a single pattern to solve every corporate challenge. True architectural maturity lies in knowing how to weave these paradigms together. By mapping tactical storage choices directly to overarching governance and integration frameworks, software architects can build resilient environments capable of evolving alongside business demands. Which of these three architectural focus areas aligns best with your specific narrative or current production environment? Let me know in the comments below.
When optimizing Spring Boot integration tests, developers often focus on obvious metrics: total build time, test execution time, CPU usage, memory consumption, or the number of failed tests. These metrics are useful, but they do not always explain why an integration test suite is slow. One of the most important hidden metrics in Spring Boot integration testing is the number of distinct ApplicationContext instances created during the test run, check out my other article. Spring’s TestContext framework can cache and reuse ApplicationContext between test classes, but only if the effective test configuration is the same. If the configuration differs, Spring has to create another context. In large enterprise applications, this can become expensive very quickly. How can the number of contexts correctly interpreted?If a test suite creates two contexts, is that good?If it creates six contexts, is that acceptable?If it creates twenty contexts, is that already a design smell?And most importantly: where should such a judgment come from? Spring itself does not define a universal threshold for a “good” or “bad” number of cached ApplicationContext instances. However, the official documentation explicitly points out that a large number of loaded contexts can make a test suite unnecessarily slow. This means the number of contexts is not just an implementation detail. It is a relevant diagnostic signal. This article explains how I derived a practical interpretation table for a real-world Spring Boot integration test suite and why such a table should be understood as a case-study heuristic, not as a universal Spring Framework rule. Test Grouping Is a Valid Concept General testing research supports that tests can be grouped by similarity, cost, coverage, or runtime behavior. This is highly relevant for Spring Boot integration tests. In Spring Boot integration testing, MergedContextConfiguration may be interpreted as one practical grouping dimension: tests with the same effective Spring configuration belong to the same context group. In this case, similarity means shared Spring test configuration. That does not mean all tests should use the same context. It means that tests should not accidentally create different contexts when they are actually testing under the same architectural conditions. Spring’s Context Cache as a Framework-Specific Grouping Mechanism Spring Boot integration tests are not plain unit tests. They often require infrastructure such as dependency injection, database configuration, security configuration, web layer configuration, mock infrastructure, external API clients, messaging components, or tenant-specific setup. Spring’s TestContext framework handles this through the ApplicationContext. The framework can reuse a context if the effective configuration is the same. The cache key is based on configuration parameters such as configuration classes, active profiles, property sources, context customizers, initializers, and other test context settings. Spring’s documentation describes this context caching mechanism and explains that contexts can be reused when the same unique context configuration is encountered again. Let me explain. Two tests may look similar to a developer but still produce different contexts if they use different profiles, properties, mocks, or imported configuration classes. They should normally produce separate context groups. For example, a database-focused test and a test involving an external OData destination may have different infrastructure requirements. In that case, a separate context is not a problem. It reflects a real test configuration group. When every test class introduces a slightly different property, mock, or configuration import without a strong technical reason. Then the number of contexts grows not because the architecture requires it, but because the test suite has configuration drift. Why Multiple Contexts Can Be Legitimate in Enterprise Applications Spring Boot itself supports different testing styles. The documentation describes @SpringBootTest for loading the application context through SpringApplication, and it also provides more focused test annotations for specific slices of an application. Spring Boot’s test slices include annotations such as @WebMvcTest, @DataJpaTest, @JsonTest, and others. These annotations intentionally load only selected parts of the application and import different auto-configurations depending on the target slice. Besides the Spring documentation, many community blogs report that different enterprise systems may have separate integration test groups, such as database-focused tests, web/controller tests, security-related tests, and so on. So, the goal should be to minimize unnecessary context fragmentation while preserving justified test configuration groups, instead of forcing the entire integration test suite into one ApplicationContext. From Test Grouping to a Context-Count Heuristic Based on this reasoning, I used the following interpretation in a case study: 1-3 application contexts show excellent context reuse,4-8 are acceptable if justified,10+ should be investigated, and a signal of a fragmented test configuration. Let's discuss the numbers. 1-3: The most integration tests share the same effective configuration. For example: Plain Text Context 1: default integration test context Context 2: database-specific context Context 3: external-system-specific context Such a structure is usually easy to understand. It suggests that the team has standardized its test profiles, properties, and infrastructure setup. 4-8: This is consistent with broader software-testing research, where test suites are not treated as one homogeneous block. They are often optimized, selected, prioritized, or clustered according to meaningful technical criteria such as coverage, execution cost, change relevance, or runtime behavior. For example: Plain Text Context 1: default SpringBootTest context Context 2: database-heavy context Context 3: external API integration context Context 4: security-specific context Context 5: multi-tenant context Context 6: messaging context Context 7: no-external-destination context Context 8: migration-specific context 10+: Once the number of contexts reaches double digits, investigation becomes worthwhile. This does not automatically mean the test suite is badly designed. Community articles on Spring test optimization show that a very large enterprise platform with many modules, tenant variants, data stores, messaging systems, and external integrations may legitimately require more contexts. So, the number 10+ is not firm, but suggests that the risk of accidental fragmentation becomes higher. Conclusion Test grouping is a recognized concept in software-testing research. Large test suites are often optimized through minimization, selection, prioritization, and clustering. These techniques are based on the idea that tests have different costs, purposes, coverage, runtime behavior, and relevance. For Spring Boot integration tests, context reuse is a framework-specific grouping criterion. (Use the method of test grouping to create Spring application contexts) Tests with the same effective MergedContextConfiguration belong to the same context group and can share the same cached ApplicationContext. Tests with genuinely different infrastructure needs may require different contexts. Therefore, the goal is not to reduce every enterprise test suite to a single context. The goal is to distinguish between justified test configuration groups and accidental configuration fragmentation. The shown numbers are a practical case-study heuristic, and not universal. But the underlying principle is robust: A small number of well-defined context groups is healthy, but a growing number of slightly different contexts is a performance smell. That principle connects Spring’s TestContext cache mechanism with a broader idea from software-testing research: large test suites should be structured intentionally, not allowed to fragment accidentally.
This series is a general-purpose getting-started guide for those of us wanting to learn about the Cloud Native Computing Foundation (CNCF) project Fluent Bit. Each article in this series addresses a single topic by providing insights into what the topic is, why we are interested in exploring that topic, where to get started with the topic, and how to get hands-on with learning about the topic as it relates to the Fluent Bit project. The idea is that each article can stand on its own, but that they also lead down a path that slowly increases our abilities to implement solutions with Fluent Bit telemetry pipelines. Let's take a look at the topic of this article, contributing to the Fluent Bit project website. This article will be a hands-on exploration of how to get started contributing blog articles to the Fluent Bit project website, something that is very accessible to newcomers and a great way to become part of the community. All examples in this article have been done on OSX and assume the reader is able to convert the actions shown here to their own local machines. Contributing to the Fluent Bit Website? Before diving into the hands-on steps, let's understand why contributing to the Fluent Bit website matters and why it's a great starting point for those of us new to contributing to a CNCF project. The Fluent Bit project website is where the community shares knowledge, tutorials, and updates with the world. Contributing a blog article means you are directly adding value to the project — helping other developers learn, discover use cases, and get hands-on experience with Fluent Bit telemetry pipelines. No deep knowledge of C is required; no need to wrestle with complex build pipelines. Just ideas, a bit of Hugo familiarity, and a willingness to follow the contribution process. As a CNCF graduated project, Fluent Bit has an active and welcoming community. Getting your name into the contributor history by writing a blog article is a genuinely meaningful step into open source participation. Where to Get Started The Fluent Bit website lives at fluent/fluent-bit-website on GitHub. Before we touch a single line, there are two repositories we need to understand and keep straight in our heads throughout this process. The first is fluent/fluent-bit-website, the upstream canonical repository owned by the Fluent community. We never push directly to this one. The second is your-username/fluent-bit-website-fork, a personal fork of the repository where all our work happens before it goes upstream via a pull request. Note it's been renamed to add the -fork to the repository name, a standard practice to easily identify forked projects. Forking the repository on GitHub, then we clone our fork locally, and finally, we add the upstream remote so we can always sync our local copy with what the community is doing directly from our command line. Shell # Fork the original website using GitHub. # Check out the fork locally, here using my fork as an example. $ git clone [email protected]:eschabell/fluent-bit-website-fork.git # Add the upstream website repo. $ git remote add upstream https://github.com/fluent/fluent-bit-website.git It's a habit worth building from day one — always sync from upstream before starting any new contribution. This prevents the diverged branch headache that will slow down our pull request later, see below for my example: Shell # Fetch any upstream work done by others. $ git fetch upstream # Sync local fork with the upstream changes. $ git rebase upstream/main # Last step, push to your fork's repository. $ git push With our fork cloned, verify the site builds locally. The Fluent Bit website uses Hugo, so install it (an exercise left to the reader), then run the Hugo server and open a browser to confirm the site renders before making any changes. Shell # Start your local copy of the website on http://localhost:9999 $ hugo server -D -p 9999 ... Watching for config changes in .../fluent-bit-website-fork/config.toml Start building sites … hugo v0.161.1 │ EN ──────────────────┼───── Pages │ 337 Paginator pages │ 0 Non-page files │ 0 Static files │ 315 Processed images │ 0 Aliases │ 1 Cleaned │ 0 Built in 319 ms Environment: "development" Serving pages from disk Running in Fast Render Mode. Web Server is available at //localhost:9999/ (bind address 127.0.0.1) Press Ctrl+C to stop ... If all goes well, we should see this on http://localhost:9999 on your local machine as shown below. Now we are ready to get started with our first change to the Fluent Bit website, maybe by submitting an article? How to Contribute Our First Article The Fluent Bit website is a Hugo site, which means every blog article is a Markdown file with a YAML front matter block at the top. This header is not optional — Hugo uses it to generate the page metadata, listing views, author information, dates, and tags. Getting the front matter right is the first practical task. It covers fields like title, date, author, tags, and any project-specific fields the site uses. This is a sample from the Fluent Bit latest release announcement blog, found under content/announcements/v5.0/: YAML --- title: 'v5.0.6' description: 'Next generation Telemetry Agent for Logs, Metrics and Traces.' url: "/announcements/v5.0.6/" release_date: 2026-05-21 publishdate: 2026-05-21 ver: v5.0.6 herobg: "/images/[email protected]" latestVer: true --- The fastest way to get this right is not to try to write it from memory. Instead, we open the blog content directory (content/posts/) in our local fork checkout and find a recently merged article to use as your reference. Mirror its file naming convention, the directory placement, and the front matter fields. Read a few existing articles to understand the tone and content that fit the Fluent Bit blog — tutorials, integration guides, project updates, and community spotlights are all examples of what works well there. Writing Our Article With front matter understood, we need to write our article in a Markdown file locally and use the Hugo server to preview it as we work. The audience is the Fluent Bit community — developers, operators, and platform engineers — so assume technical literacy but do not assume deep Fluent Bit expertise, especially for introductory topics. Submitting Our Changes Once the article is written and previewed correctly, here is the process to follow to submit your changes to the website project for the maintainer's review. Create a new branch off the synced fork — we never want to work directly on our main branch. Commit your new file with a clear, descriptive commit message. Push to your fork as shown below. Shell # Work on a branch. $ git checkout -b erics_my_new_article # When ready to submit our changes. $ git add content/posts/my-new-fb-article.md # Commit the changes using a signed commit (assumes GPG set up). $ git commit -S # Push the changes to our repository $ git push --set-upstream origin erics_my_new_article Now we open a pull request against fluent/fluent-bit-website from the GitHub UI. In the pull request description, explain what the article covers and why it is a good fit for the blog. Then explicitly request a review — do not assume the PR will be picked up automatically, and make sure to tag a reviewer. If for some reason it's not possible to request a review through the UI, then feel free to post a comment after submitting the PR and ask me to review, as I'm always happy to help. AI Good Habits for Contributors This section is worth paying close attention to because AI tooling is now part of many developers' daily workflow, and if used carelessly in an open-source context, it can create real problems and broken trust with core project maintainers. Here are my personal ground rules for working with AI assistance on open-source projects and how I work with Fluent Bit projects. Use Your Local Fork Filesystem Configure your AI tool to work against your local fork checkouts — not a downloaded copy to /tmp or any other ephemeral scratch directory. This is important because your working directory is already version-controlled. Every change the AI proposes is immediately visible via git diff, which means you always know exactly what changed before you decide to commit anything. It also saves on token usage and bandwidth, speeding up the AI results to your queries. Never Let AI Modify Without Your Approval I always set a personal rule: no line of code or documentation changes without your explicit review and approval, line by line. AI tools should propose changes, and you accept, reject, or modify them. This is not just good open source hygiene — it is how you learn the codebase and the project's conventions. The Fluent Bit docs and website have a specific voice and structure. Furthermore, you are putting your name (signing the commits) on any changes you are pushing, so you might want to make sure you agree with each line that is being modified in your name. Never Let AI Touch Git This one is non-negotiable for me in git interactions with my upstream repositories. AI does not commit, does not push, does not fork, and does not open pull requests in my inner developer loop. I do all of that manually. Commits are attribution. When you sign a commit with your name and email, you are asserting that you wrote or have the right to submit that content. In a CNCF project operating under a DCO (Developer Certificate of Origin), this is a legal and community trust matter, not a formality. Keep your hands on the wheel for all git operations, and you will also understand the processes and retain your skills. Check for Tests When Adding New Code This one is more for code-based repositories, but it's good to know as background information here. If your contribution goes beyond a blog article and into actual code — a plugin, a configuration example, a script — check whether the Fluent Bit fork has existing test patterns for that area. Follow them. If you are adding testable behavior, add tests. Ask in the PR or issue if you are unsure what test coverage is expected. Maintainers would rather answer that question up front than request changes after review. Always Provide a Proper Commit Message Every commit should have a clear, structured message. At minimum: a short subject line describing what changed, followed by a list in the body describing why and what specifically was modified. It must be signed, or it will fail on DCO sign-off in the CI/CD process. Check the contributing guide for the expected standard. A good commit message is also your own paper trail — if a maintainer asks why you changed something, your commit history should answer the question. Remember, you are signing this, not your AI tooling. Nice to Have: Open an Issue Before a PR This is more for code repositories than for the website project, but good background information. For anything beyond a trivially obvious contribution, the best practice is to open a GitHub issue first. Describe what you want to write or fix, get a signal from the maintainers that it is welcome, then do the work and open the PR referencing that you are fixing that issue. More in the Series In this article, we explored step-by-step what it takes to make our first article contribution to the Fluent Bit project website — from setting up our fork and getting Hugo running locally to configuring a blog article and submitting a pull request. Finally, we tried to help with establishing good habits around AI tooling along the way.
The System Was Broken, and Everyone Knew It Our dashboards refreshed overnight. That was the expectation. Then, one week, they started taking six hours. Then eight. On a bad day, the full 24 hours. Business users would come in on Monday morning and still see Friday's numbers. The data was wrong, too. Not wrong in an obvious way. Wrong in the quiet way where someone in finance notices a number looks off, checks it manually, finds a discrepancy, and then stops trusting the system. That is the worst kind of mistake. Because once trust is gone, you do not just have a technical problem. You have a people problem. Our stack was old. SQL Server feeding an on-premises data warehouse, running through an ETL tool that was older than some of our engineers. It worked fine when data volumes were smaller. As volumes grew, the whole thing started showing cracks. One pipeline failing would back up three others. Dependencies were fragile. Retries were manual. The team spent more time keeping the lights on than actually doing analytics. The most frustrating part was not the downtime. It was watching a report go out with wrong numbers and knowing exactly why it happened, and not having a fast way to fix it. We needed to rebuild, not patch. And we needed something that could handle what we were asking of it. Why Airflow We looked at a few options. We kept coming back to Apache Airflow for three reasons. First, it is Python-based. Our team writes Python. That matters more than people admit. The best orchestration tool is the one your team will use properly. Second, it integrates with everything. We were moving to Databricks and AWS. We were already using Power BI and Tableau. Airflow has native integrations for all of it. We were not going to spend six months building connectors before we could even start. Third, the DAG model forced us to think clearly about dependencies. That was a feature. Our old system had implicit dependencies that nobody fully understood. Writing explicit DAGs made us document what we needed the data to do. One more thing that mattered: Airflow is what the rest of the industry uses. When we hired someone new, there was a good chance they already knew it. When something broke, the community had probably seen it before. What We Actually Built The Pipeline Data comes in from transactional sources and lands in AWS S3. Airflow picks it up when it arrives, not on a schedule, using an S3 sensor in deferrable mode. This was one of the better decisions we made. Event-based triggering means the pipeline starts as soon as the data is ready. No more sitting in a queue waiting for a fixed run time. From S3, Airflow kicks off Spark jobs in Databricks. The jobs transform raw data into clean Delta tables. Once that is done, Airflow calls the Power BI and Tableau APIs to trigger dashboard refreshes. Before any refresh hits the dashboard, we validate the data. If something looks wrong, the refresh does not go through, and the team gets an alert. After a successful refresh, stakeholders get a Slack notification and an email. They know exactly when fresh data is available. They stopped asking. 4 Patterns That Made It Work We tried a lot of approaches. These four became our standard: 1. Parameterized DAG Templates We built one template, not fifty DAGs. New data sources get added by updating a config file. This cut development time by around 80% once we had the pattern right. 2. Event-Based Triggers S3 sensor in deferrable mode. The pipeline runs when data arrives, not on a schedule. This alone took significant latency out of the system. 3. SLA Monitoring on Every DAG If a job runs longer than expected, the team gets an alert. We find out before a business deadline is missed, not after. 4. Automatic Retries With Escalation Transient failures retry automatically. Persistent failures send an alert. Engineers deal with real problems, not network hiccups. The Numbers After 6 Months Data refresh went from 24 hours to under 2 hours for all critical processes. Manual intervention dropped 70%. We hit 100% SLA compliance for six straight months. The number I care most about is the last one: stakeholder trust. After the system stabilized, our finance team stopped verifying dashboard numbers against source data before board meetings. That is not a metric you can put in a dashboard. But it is the one that tells you the work was worth it. When the numbers got right, people were genuinely happy. Not just satisfied. Happy. That is what accurate data does to a team that has been burned by wrong numbers. Beyond the BI team, the impact spread. Executives could see business unit performance in real time. Forecasting got more accurate because the inputs were reliable. The data team stopped being the people who maintain the pipes and started being the people who answer business questions. What I Learned 1. Start With Less Than You Think You Need We started with three data sources. Not the whole system. Starting small, let us figure out the patterns before we have to scale them. Every team I have seen try to migrate everything at once runs into problems that could have been caught earlier with a smaller scope. 2. DAG Readability Is Not Optional Someone will read your DAG at 2 am when something is broken. Make it readable. Good names, modular code, comments that explain why, not just what. We paid for skipping this early on. 3. Build Monitoring Before You Need It We deployed the first version without proper monitoring and spent weeks discovering failures reactively. Build observability first. Everything else can be improved later, but you cannot go back and add monitoring to failures that have already happened. 4. Understand Beyond the System This is the one I wish someone had told me. Organizations trust their data systems. That trust is good, but it can also mask problems. Sometimes the pipeline is running fine, and the data is still wrong because of something upstream you did not model. You must go beyond the system to find those problems. Query the source. Check the logic. Do not assume that a green DAG means good data. 5. Treat DAGs Like Production Code Code review. Version control. Testing. If you would not deploy application code without these, do not deploy DAGs without them either. We learned this by breaking things in production that we would have caught with a proper review process. What I Would Do Differently Two things. Data quality checks should have been built into the framework from the start. We added them reactively when problems showed up. Building a proper data quality layer upfront would have caught issues before they became dashboard problems. I would have brought business stakeholders into the DAG design conversations earlier. The engineers know what the data needs to do technically. The business users know what questions the data needs to answer. Those are different things. Getting both perspectives at the design stage produces better pipelines. Where This Goes Next The system works. That is not the end of the story; it is the beginning of what you can actually do when infrastructure stops being the constraint. With reliable data pipelines in place, the team can focus on predictive analytics, anomaly detection, and real-time decision support. The boring infrastructure work unlocks the interesting analytics work. That was always the point. If your team is still dealing with broken pipelines and inaccurate data, the problem is probably not your people. It is the architecture. Airflow will not fix everything, but it will give you the orchestration layer to build something that works. Start with one pipeline. Get it right. Then scale it. The goal was never faster dashboards. The goal was data that people trusted enough to make decisions with. Everything else followed from that.
Enterprise REST integrations rarely fail in a clean, binary way. The dominant failure modes are usually partial and ambiguous: a socket closes after a downstream system commits, a gateway returns a timeout while the target service is still processing, a throttling layer asks for a pause, or a dependency becomes slow enough that waiting callers begin to exhaust threads, connections, and ports. In that environment, simplistic catch-and-retry logic is not resilience. It is uncontrolled traffic generation. Mature error handling starts by accepting that not every failure is retryable, that the HTTP protocol already exposes useful semantics for temporary overload and replay safety, and that retry logic has to cooperate with circuit breaking, fallback paths, and telemetry rather than act on its own. Failure Semantics Before Retry A robust retry policy begins with failure classification, not with a retry counter. Temporary transport failures, selected timeout conditions, and explicit server-side signals such as 503 Service Unavailable and 429 Too Many Requests are fundamentally different from validation, authorization, or contract violations. 503 is explicitly defined as a temporary inability to handle the request, potentially accompanied by Retry-After, while 429 represents rate limiting and may also carry a Retry-After value. By contrast, retrying an invalid request usually only repeats the same defect. Microsoft’s retry guidance makes the same distinction: transient faults are worth retrying after a delay, while non-transient faults should be surfaced and handled as errors. HTTP method semantics also matter more than most retry interceptors admit. RFC 9110 defines safe methods as read-only and idempotent methods as those whose intended effect is the same whether one request arrives or many. It explicitly permits automatic retries for idempotent methods after a communication failure, but advises against automatic retries for non-idempotent methods unless the client has another way to know the action is safe to replay or to prove that the original request was never applied. That is the reason payment capture, shipment reservation, and account mutation flows need business idempotency keys or conditional requests, not just a library annotation. For update-heavy integrations, 428 Precondition Required, If-Match, and 412 Precondition Failed provide a standards-based path to prevent lost updates and make recovery from ambiguous failures safer. Timeouts belong in the same discussion because a retry without a timeout is effectively an admission that the caller is willing to hold scarce resources indefinitely. The AWS Builders’ Library notes that long waits tie up memory, threads, connections, ephemeral ports, and other limited resources, and that timeouts set too low can also create cascading retry traffic. In practice, the retry policy and the timeout budget are the same control surface viewed from different angles. If the timeout is unbounded, retries arrive too late to be useful. If retries are unbounded, a timeout only delays the storm. Making HTTP Responses Actionable Once the retry boundary is defined, error payloads need to become machine-actionable. RFC 9457 standardizes the fields that matter: type, title, status, detail, and instance. The specification is especially useful because it separates a human-readable explanation from a machine-readable classification. The detail field is intended to help explain the specific occurrence and is not meant to be parsed for program logic; machine consumers should rely on type and well-defined extension members instead. Spring’s ProblemDetail maps directly to this model and supports non-standard properties through an extension map that can be rendered as top-level JSON. That gives upstream services a clean way to expose retry hints, domain error codes, and correlation information without forcing clients to scrape message strings. That structure belongs at the client boundary, where HTTP details are translated once into domain-specific exceptions. Spring’s synchronous RestClient is well-suited to this because it allows custom status handlers rather than forcing every 4xx into the same exception path. Java private ShipmentResponse reserveShipment(ShipmentCommand command) { return restClient.post() .uri("/shipments/reservations") .header("Idempotency-Key", command.requestId()) .body(command) .retrieve() .onStatus(status -> status.value() == 429 || status.value() == 503 || status.value() == 504, (request, response) -> { var retryAfter = response.getHeaders().getFirst("Retry-After"); throw new TransientUpstreamException("shipping-api", retryAfter); }) .onStatus(HttpStatusCode::is4xxClientError, (request, response) -> { throw new NonRetryableUpstreamException("shipping-api"); }) .body(ShipmentResponse.class); } This boundary keeps the retry policy honest. Throttling and temporary unavailability become explicit transient exceptions that can carry backoff hints, while semantic client errors become immediately terminal. The idempotency key on the outbound write does not make every POST automatically safe, but it creates the contract required for the upstream side to deduplicate repeated attempts when replay becomes necessary after a timeout or dropped connection. That is substantially safer than retrying blindly after any exception because the classification is now based on protocol semantics and upstream intent rather than on a generic catch block. Backoff That Respects the Protocol After classification comes timing. Fixed-delay retry loops are attractive because they are easy to read, but they are a poor fit for overloaded distributed systems. Both AWS and Azure recommend pausing between attempts and increasing the delay because immediate retries often land while the dependency is still unhealthy. AWS adds the deeper operational point: when many clients retry in lockstep, recovery traffic becomes a synchronized burst, which is exactly why jitter matters. Azure’s retry-storm guidance makes the operational rule even more direct: retry attempts and total duration have to be limited, and the retry-after header must be honored when it is sent. Retry-After can be either a relative number of seconds or an absolute HTTP date, so treating it as a magic integer is incomplete protocol handling. Resilience4j is useful here because its retry model is more expressive than a simple fixed wait. The library supports maxAttempts, waitDuration, retryOnResultPredicate, exception-based selection, and an intervalBiFunction that can compute the next delay from the attempt count and either a result or an exception. Java RetryConfig retryConfig = RetryConfig.custom() .maxAttempts(4) .retryOnException(ex -> ex instanceof ResourceAccessException || ex instanceof TransientUpstreamException) .ignoreExceptions(NonRetryableUpstreamException.class, ValidationException.class) .intervalBiFunction((attempt, either) -> { var ex = either.getLeft(); if (ex instanceof TransientUpstreamException t && t.retryAfter() != null) { return t.retryAfterDuration(); } var base = Math.min(200L * (1L << (attempt - 1)), 3000L); var jitter = ThreadLocalRandom.current().nextLong(0, 250); return Duration.ofMillis(base + jitter); }) .failAfterMaxAttempts(true) .build(); This pattern does two things that enterprise integrations often miss. First, it respects protocol hints when the server provides them. Second, when the server does not provide them, it falls back to bounded exponential delay with jitter instead of immediate replay. That preserves throughput during brief faults without turning one failed request into a tight loop. It also keeps business semantics intact by excluding validation failures and other known terminal conditions from the retry path entirely. Retry With Circuit Breaking and Fallbacks Retry should never be the only protection layer around a dependency. Azure’s circuit breaker guidance draws the distinction clearly: retry assumes the operation may succeed soon, while a circuit breaker stops calls that are likely to fail and allows the system to probe for recovery later. Resilience4j implements this with count-based or time-based sliding windows and explicit breaker states, which makes the breaker a statistical decision point rather than a hardcoded timeout reaction. In practice, retries belong inside a bounded window, and the circuit breaker decides when that window should close early because the failure is no longer transient. For annotation-driven Spring services, that composition stays concise as long as the fallback preserves business truth. A fallback should not fabricate success merely to keep the API green. A degraded but truthful state is a better contract than a false positive. Java @CircuitBreaker(name = "paymentGateway", fallbackMethod = "deferCapture") @Retry(name = "paymentGateway") public PaymentResult capture(PaymentCommand command) { return paymentGateway.capture(command); } private PaymentResult deferCapture(PaymentCommand command, Exception ex) { outbox.save(new PendingCapture(command.paymentId(), command.requestId(), ex.getMessage())); return PaymentResult.pending(command.paymentId()); } The important detail is not the annotation pair itself, but the semantics of the fallback. Writing an outbox record or reconciliation task acknowledges that the payment state is uncertain and that recovery will continue asynchronously. Returning pending instead of captured prevents downstream systems from treating a degraded path as a confirmed business success. That is the difference between fault tolerance and silent data corruption. Reactive Flows and the Hidden Cost of Convenience Reactive clients make retry composition even easier, which is precisely why strict filtering matters. Spring’s WebClient maps responses with status codes of 400 and above to exceptions by default, and onStatus allows those responses to be reclassified. Reactor then adds a retry DSL where Retry.backoff is preconfigured for exponential backoff with jitter. The result is elegant, but elegance is dangerous when it hides accidental replay of all failures instead of only transient ones. Java public Mono<InventorySnapshot> fetchInventory(String sku) { return webClient.get() .uri("/inventory/{sku}", sku) .retrieve() .onStatus(status -> status.value() == 429 || status.value() == 503, response -> response.bodyToMono(ProblemDetail.class) .defaultIfEmpty(ProblemDetail.forStatus(response.statusCode())) .map(problem -> new TransientUpstreamException(problem.getDetail()))) .bodyToMono(InventorySnapshot.class) .retryWhen(Retry.backoff(3, Duration.ofMillis(250)) .filter(TransientUpstreamException.class::isInstance)); } The critical move in this style is the filter. Without it, every WebClientResponseException becomes retryable, which means malformed requests, unauthorized access, and contract defects start looping through the same pipeline as a temporary overload. With the filter in place, the reactive chain remains expressive without becoming indiscriminate. The same principle applies to result-based retries as well: only states that are explicitly modeled as transient should flow back into the retry companion. Visibility as Part of the Contract An enterprise retry policy that cannot be observed is effectively untestable in production. Spring’s observability support is built around Micrometer observations, and Resilience4j provides a Micrometer module for its fault-tolerance primitives. That combination makes it possible to expose retry counts, breaker state, final outcome, and request timing in the same telemetry fabric. At the protocol level, RFC 9457’s instance field provides a stable error occurrence identifier that can also be propagated into logs and traces. Once those signals exist, a slow integration no longer appears as a single long call; it becomes visible as one business request that triggered multiple upstream attempts before succeeding or degrading. Conclusion Advanced error handling in enterprise REST integrations is not built from retries alone. It is built from protocol-aware classification, explicit replay safety, structured error payloads, bounded backoff with jitter, circuit breaking for persistent faults, truthful fallbacks, and telemetry that exposes every extra attempt. HTTP already provides essential semantics for temporary overload, rate limiting, and conditional updates, while Spring, Reactor, and Resilience4j provide the implementation hooks needed to preserve those semantics in code. When those layers are combined deliberately, retries stop being a reflex and become a controlled recovery strategy that protects both correctness and system stability.
Xavier Portilla Edo
Head of Cloud Infrastructure,
Voiceflow
Boris Zaikin
Lead Solution Architect,
CloudAstro GmBH
Sai Sandeep Ogety
Director of Cloud & DevOps Engineering,
Fidelity Investments