A database is a collection of structured data that is stored in a computer system, and it can be hosted on-premises or in the cloud. As databases are designed to enable easy access to data, our resources are compiled here for smooth browsing of everything you need to know from database management systems to database languages.
Implementing Secure API Gateways for Microservices Architecture
Contract-First Integration: Building Scalable Systems With Flyway, OpenAPI, and Kafka
In this article, I will discuss a highly available solution developed using Spring Boot 3 and Spring Security 6 to address the "centralized authentication method" problem frequently seen in modern microservice ecosystems. We are not simply moving to an "authorization service"; we are examining the cache-first pattern, which minimizes DB usage, and the Redis Sentinel enhancement, which guarantees system persistence. Why a Separate Authentication Service? While embedding security into each service is an option in microservices, I have always found it more logical to proceed with a centralized Auth service and API Gateway combination. DRY (Don't Repeat Yourself): Using token authentication logic in many services increases extra maintenance costs.Isolation: Business services focus only on business logic; they don't deal with "is this token valid?" questions.Performance: Thanks to the Redis connection, instead of going to the database with every request, we can resolve the validation via the cache in milliseconds. Plain Text [Client] ──► [API Gateway] ──► [Auth Service: validate token] │ (valid) ▼ [Backend Microservices] Cache-Focused Approach: Reducing Database Load In the classic workflow, every login request puts a load on the DB. With the cache-first approach, the process proceeds like this with a POST /auth/signin request: First, Redis is checked. If there is a valid and unexpired token for the user, it is replicated directly. In case of cache deficiency, AuthManager.authenticate() is activated, a DB query is sent, and a BCrypt check is performed. After a successful login, a token is generated with JJWT (HS256). This token is given to Redis with our changes and TTL (e.g., 24 minutes), and personal responses are converted. In this way, it protects our main database, especially in brute-force or high-intensity login password attacks. Plain Text POST /auth/signin │ ▼ ┌──────────────────────────────┐ │ Token exists in Redis? │──── YES ──► Return token (0 DB queries) └──────────────────────────────┘ │ NO ▼ ┌──────────────────────────────┐ │ AuthManager.authenticate() │ (DB query + BCrypt verification) └──────────────────────────────┘ │ ▼ ┌──────────────────────────────┐ │ Generate JWT (JJWT HS256) │ └──────────────────────────────┘ │ ▼ ┌──────────────────────────────┐ │ Write to Redis (TTL: 24 min)│ └──────────────────────────────┘ │ ▼ Return token Implementation Details User Entity and UserDetails Integration In most projects, unnecessary mappings are performed between the User asset and the UserDetails objects expected by Spring Security. To reduce complexity, the User Entity is directly derived from the UserDetails interface. This makes the code cleaner and makes it "native," as outlined by Spring Security. Java @Data @Builder @NoArgsConstructor @AllArgsConstructor @Entity @Table(name = "T_APP_USER") public class User implements UserDetails { @Id @GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "seq_user_gen") @SequenceGenerator(name = "seq_user_gen", sequenceName = "SEQ_APP_USER", allocationSize = 1) @Column(name = "idx") private Long idx; @Column(name = "firstname") private String firstName; @Column(name = "lastname") private String lastName; @Column(unique = true, name = "email") private String email; @Column(name = "accesskey") private String accessKey; // BCrypt-hashed @Column(name = "role") @Enumerated(EnumType.STRING) private Role role; @Override public Collection<? extends GrantedAuthority> getAuthorities() { return List.of(new SimpleGrantedAuthority(role.name())); } @Override public String getUsername() { return email; } @Override public String getPassword() { return accessKey; } @Override public boolean isAccountNonExpired() { return true; } @Override public boolean isAccountNonLocked() { return true; } @Override public boolean isCredentialsNonExpired() { return true; } @Override public boolean isEnabled() { return true; } } JWT Filter: The Gateway to Security The request to the system passes through the OncePerRequestFilter. Here, using JwtAuthenticationFilter, we parse the token in each request and populate the SecurityContext. By using the new SecurityFilterChain bean introduced with Spring Security 6, we have disabled CSRF and made session management completely stateless. Token Generation and Validation Java public interface JwtService { String extractUserName(String token); String generateToken(UserDetails userDetails); boolean isTokenValid(String token, UserDetails userDetails); } @Service public class JwtServiceImpl implements JwtService { @Value("${token.signing.key}") private String jwtSigningKey; // Base64-encoded secret key @Override public String extractUserName(String token) { return extractClaim(token, Claims::getSubject); } @Override public String generateToken(UserDetails userDetails) { return Jwts.builder() .setClaims(new HashMap<>()) .setSubject(userDetails.getUsername()) .setIssuedAt(new Date(System.currentTimeMillis())) .setExpiration(new Date(System.currentTimeMillis() + 1000 * 60 * 24)) .signWith(getSigningKey(), SignatureAlgorithm.HS256) .compact(); } @Override public boolean isTokenValid(String token, UserDetails userDetails) { final String userName = extractUserName(token); return userName.equals(userDetails.getUsername()) && !isTokenExpired(token); } private <T> T extractClaim(String token, Function<Claims, T> claimsResolver) { return claimsResolver.apply( Jwts.parserBuilder() .setSigningKey(getSigningKey()) .build() .parseClaimsJws(token) .getBody() ); } private boolean isTokenExpired(String token) { return extractClaim(token, Claims::getExpiration).before(new Date()); } private Key getSigningKey() { return Keys.hmacShaKeyFor(Decoders.BASE64.decode(jwtSigningKey)); } } High Availability: Redis Sentinel Using a single Redis instance means that the Auth service has a "Single Point of Failure." If Redis crashes, no one can access the system. This risk mitigation was achieved using Redis Sentinel. Thanks to the Sentinel structure: If the master node crashes, the dependent node is automatically promoted to master via failover. On the application side, we continuously manage these transitions using the Lettuce driver. Technical Stack and Requirements Redis Sentinel configuration: Java @Configuration public class RedisConfig { @Value("${spring.redis.sentinel.master}") private String master; @Value("${spring.redis.sentinel.nodes}") private String sentinelNodes; @Value("${spring.redis.password}") private String password; @Bean public RedisConnectionFactory redisConnectionFactory() { RedisSentinelConfiguration sentinelConfig = new RedisSentinelConfiguration() .master(master); for (String node : sentinelNodes.split(",")) { String[] hostPort = node.split(":"); sentinelConfig.sentinel(hostPort[0], Integer.parseInt(hostPort[1])); } sentinelConfig.setPassword(RedisPassword.of(password)); return new LettuceConnectionFactory(sentinelConfig); } } Plain Text yaml env: - name: spring.redis.sentinel.master valueFrom: secretKeyRef: name: redis-user-secret key: username - name: spring.redis.password valueFrom: secretKeyRef: name: redis-user-secret key: password Token cache service: Java @Service public class TokenCacheServiceImpl { private final RedisTemplate<String, String> redisTemplate; public TokenCacheServiceImpl(RedisTemplate<String, String> redisTemplate) { this.redisTemplate = redisTemplate; } public void cacheToken(String username, String token, long duration, TimeUnit unit) { redisTemplate.opsForValue().set(username, token, duration, unit); } @Cacheable(value = "tokens", key = "#username") public String getToken(String username) { return redisTemplate.opsForValue().get(username); } } Authentication service: signup and signin: Java @Service @RequiredArgsConstructor public class AuthenticationServiceImpl implements AuthenticationService { private final UserRepository userRepository; private final PasswordEncoder passwordEncoder; private final JwtService jwtService; private final AuthenticationManager authenticationManager; private final TokenCacheServiceImpl tokenCacheService; @Override public JwtAuthenticationResponse signup(SignUpRequest request) { var user = User.builder() .firstName(request.getFirstName()) .lastName(request.getLastName()) .email(request.getEmail()) .accessKey(passwordEncoder.encode(request.getAccessKey())) // BCrypt .role(Role.USER) .build(); userRepository.save(user); var jwt = jwtService.generateToken(user); return JwtAuthenticationResponse.builder().token(jwt).build(); } @Override public JwtAuthenticationResponse signin(SigninRequest request) { // 1. Check Redis cache first String cachedToken = tokenCacheService.getToken(request.getEmail()); if (cachedToken != null) { return JwtAuthenticationResponse.builder().token(cachedToken).build(); } // 2. If not cached, authenticate (DB + BCrypt) authenticationManager.authenticate( new UsernamePasswordAuthenticationToken(request.getEmail(), request.getAccessKey()) ); var user = userRepository.findByEmail(request.getEmail()) .orElseThrow(() -> new IllegalArgumentException("Invalid credentials.")); // 3. Generate token and write to Redis (24 min TTL) var jwt = jwtService.generateToken(user); tokenCacheService.cacheToken(request.getEmail(), jwt, 24, TimeUnit.MINUTES); return JwtAuthenticationResponse.builder().token(jwt).build(); } } JWT authentication filter: Java @Component @RequiredArgsConstructor public class JwtAuthenticationFilter extends OncePerRequestFilter { private final JwtService jwtService; private final UserService userService; @Override protected void doFilterInternal( @NonNull HttpServletRequest request, @NonNull HttpServletResponse response, @NonNull FilterChain filterChain ) throws ServletException, IOException { final String authHeader = request.getHeader("Authorization"); // Pass through if no Authorization header or doesn't start with Bearer if (StringUtils.isEmpty(authHeader) || !StringUtils.startsWith(authHeader, "Bearer ")) { filterChain.doFilter(request, response); return; } final String jwt = authHeader.substring(7); final String userEmail = jwtService.extractUserName(jwt); // Process only if SecurityContext has no authentication yet if (StringUtils.isNotEmpty(userEmail) && SecurityContextHolder.getContext().getAuthentication() == null) { UserDetails userDetails = userService.userDetailsService() .loadUserByUsername(userEmail); if (jwtService.isTokenValid(jwt, userDetails)) { SecurityContext context = SecurityContextHolder.createEmptyContext(); UsernamePasswordAuthenticationToken authToken = new UsernamePasswordAuthenticationToken( userDetails, null, userDetails.getAuthorities() ); authToken.setDetails(new WebAuthenticationDetailsSource().buildDetails(request)); context.setAuthentication(authToken); SecurityContextHolder.setContext(context); } } filterChain.doFilter(request, response); } } Spring Security 6 configuration: Java @Configuration @EnableWebSecurity @RequiredArgsConstructor public class SecurityConfiguration { private final JwtAuthenticationFilter jwtAuthenticationFilter; private final UserService userService; @Bean public SecurityFilterChain securityFilterChain(HttpSecurity http) throws Exception { http .csrf(AbstractHttpConfigurer::disable) // Stateless → no CSRF needed .authorizeHttpRequests(request -> request .requestMatchers("/auth/**").permitAll() // Auth endpoints open to all .anyRequest().authenticated() ) .sessionManagement(manager -> manager.sessionCreationPolicy(STATELESS) // No server-side session ) .authenticationProvider(authenticationProvider()) .addFilterBefore(jwtAuthenticationFilter, // JWT filter runs first UsernamePasswordAuthenticationFilter.class); return http.build(); } @Bean public PasswordEncoder passwordEncoder() { return new BCryptPasswordEncoder(); } @Bean public AuthenticationProvider authenticationProvider() { DaoAuthenticationProvider authProvider = new DaoAuthenticationProvider(); authProvider.setUserDetailsService(userService.userDetailsService()); authProvider.setPasswordEncoder(passwordEncoder()); return authProvider; } @Bean public AuthenticationManager authenticationManager(AuthenticationConfiguration config) throws Exception { return config.getAuthenticationManager(); } } Unit tests: Java @Test @DisplayName("Signin: if token is cached, should not query the DB") void testSignInWithCachedToken() { when(tokenCacheService.getToken(TEST_EMAIL)).thenReturn(TEST_TOKEN); JwtAuthenticationResponse response = authenticationService.signin( SigninRequest.builder().email(TEST_EMAIL).accessKey(TEST_PASSWORD).build() ); assertEquals(TEST_TOKEN, response.getToken()); verifyNoInteractions(authenticationManager); // No DB + BCrypt call should happen verifyNoInteractions(userRepository); } // Invalid token test — SecurityContext should remain empty @Test @DisplayName("With an invalid token, SecurityContext should remain empty") void testDoFilterInternalInvalidToken() throws Exception { when(request.getHeader("Authorization")).thenReturn("Bearer " + INVALID_TOKEN); when(jwtService.extractUserName(INVALID_TOKEN)).thenReturn(TEST_EMAIL); when(userService.userDetailsService()).thenReturn(userDetailsService); when(userDetailsService.loadUserByUsername(TEST_EMAIL)).thenReturn(userDetails); when(jwtService.isTokenValid(INVALID_TOKEN, userDetails)).thenReturn(false); jwtAuthenticationFilter.doFilterInternal(request, response, filterChain); verify(filterChain).doFilter(request, response); assertNull(SecurityContextHolder.getContext().getAuthentication()); } Summary and Conclusion With the purchasing architecture, not only a secure login screen; It has built an architecture that is extremely scalable, overcomes database bottlenecks with caching, and meets high availability (HA) standards. In particular, the modern architecture offered by Spring Boot 3 has made the security layer much more flexible. If you are starting a large-scale microservice project, you can design token management from the outset in this "stateless" and "cached" manner.
Here’s a problem I kept running into: I need a company’s brand assets — their logo, their colors, maybe a hero image — and there’s no API for it. You’re building a white-label dashboard. Or a proposal generator. Or an integration that sends branded emails on behalf of customers. Every time, you end up on their website, right-clicking “Inspect Element,” eyedropping hex codes, and downloading a pixelated PNG from their footer. It’s tedious, it breaks when they redesign, and it doesn’t scale. So I built OpenBrand, an open-source library that extracts brand assets from any URL. Give it a website, get back structured JSON with logos, colors, and backdrop images. No API key needed if you run it as a library. The Problem Is Harder Than It Looks You might think: “Just scrape the <link rel='icon'> and call it a day.” But favicons are 16x16 pixels. That’s not a logo — that’s a logo for ants. Real brand extraction needs to handle: Logo detection. Companies put their logos in wildly different places. Some use an <svg> in the header. Some use a <img> with a class like .site-logo or .brand. Some only have it as an Open Graph image in their <meta> tags. Some have it nowhere obvious, and you need to check their favicon manifest for higher-resolution variants. Color extraction. The brand’s primary color might be in CSS custom properties (--brand-primary), in computed styles on key elements, in their stylesheet as the most-used non-white/non-black color, or embedded in their logo SVG. And you need to distinguish between “the brand color” and “the color they use for body text.” Backdrop images. Hero images, background gradients, Open Graph images — these are useful for building branded experiences, but they’re scattered across different DOM locations and meta tags. The point is: there’s no standard for where brands put their assets. Every website is a snowflake. How OpenBrand Works OpenBrand uses server-side HTML scraping with Cheerio and image analysis with Sharp. No headless browser, no Puppeteer — just direct HTTP requests and intelligent heuristics. Here’s the approach: JavaScript // Fetch the page HTML with a browser-like User-Agent const html = await fetch('https://stripe.com', { headers: { 'User-Agent': 'Mozilla/5.0 ...' } }).then(r => r.text()); // Parse with Cheerio (jQuery-like DOM API for Node.js) const $ = cheerio.load(html); // Run extraction heuristics across the parsed markup For sites that block direct requests, it falls back to Jina Reader, a service that renders pages and returns clean content. The extraction pipeline runs in this order: Logos – Check <svg> elements in header/nav, <img> elements with logo-related classes/IDs, <link rel="icon"> manifest for high-res variants, Open Graph/Twitter card images as fallbackColors – Extract theme-color meta tags, parse manifest.json, sample dominant colors from logo images using SharpBackdrops – Find Open Graph images, hero/banner images, background images on key sections The library returns structured data: TypeScript import { extractBrandAssets } from "openbrand"; const result = await extractBrandAssets("https://stripe.com"); if (result.ok) { console.log(result.data.brand_name); // "Stripe" console.log(result.data.logos); // LogoAsset[] - SVGs, PNGs with URLs and dimensions console.log(result.data.colors); // ColorAsset[] - hex values with context console.log(result.data.backdrop_images); // BackdropAsset[] - hero images, backgrounds } Three Ways to Use It As an npm package (no API key, runs on your server): Shell npm add openbrand TypeScript import { extractBrandAssets } from "openbrand"; const result = await extractBrandAssets("https://linear.app"); Lightweight and fast — no browser process to manage. Good for build scripts, CI pipelines, serverless functions, or backend services. As an API (free API key from openbrand.sh): Shell curl "https://openbrand.sh/api/extract?url=https://stripe.com" \ -H "Authorization: Bearer your_api_key" Good for client-side apps or anywhere you want a simple HTTP call. As an agent skill (for Claude Code, Cursor, Codex, Gemini CLI): Shell npx skills add ethanjyx/openbrand Then just ask your AI agent: “Extract brand assets from linear.app.” This is probably the most interesting distribution channel — 40+ AI coding agents can use it as a tool. What I Got Wrong (And What I’d Do Differently) Some honest takes on the tradeoffs: Static HTML has limits. We don’t execute JavaScript, which means heavily SPA-dependent sites may not expose all their brand assets in the initial HTML. In practice, this matters less than you’d think - logos, favicons, OG tags, and most brand-relevant markup live in static HTML. For the few sites where it fails, the Jina Reader fallback helps. We chose speed and simplicity over completeness. Logo detection is fuzzy. There’s no semantic HTML tag for “this is the company’s logo.” Heuristics work well for ~85% of sites but break on unusual layouts. Some sites put their logo in a <div> with a background image. Some use CSS mask-image. The current approach has a priority-ranked list of strategies, but it’s not perfect. Color extraction conflates brand color with design system color. A company might use blue as its brand color but green for its primary CTA buttons. OpenBrand currently returns both without distinguishing between them. This is a known limitation - brand identity and UI design tokens overlap but aren’t identical. Rate limiting. If you’re extracting from many URLs, you need to be respectful. The API has rate limits built in, but the npm package doesn’t throttle — that’s your responsibility. Where This Is Actually Useful Real use cases I’ve seen or built: White-label SaaS: Automatically theme a customer’s dashboard using their brand colors on first loginProposal/invoice generators: Pull the client’s logo and colors to brand documents without asking them to upload assetsCompetitive analysis tools: Track how competitors’ branding evolves over timeAI agents: Give LLMs the ability to “see” a brand without manual configuration — useful for generating branded content, emails, or presentationsDesign system bootstrapping: Start a new project by extracting the brand’s existing visual language Try It The repo is at github.com/ethanjyx/openbrand. MIT licensed. The fastest way to see if it works for your use case: Shell npm add openbrand node -e " import('openbrand').then(async ({extractBrandAssets}) => { const r = await extractBrandAssets('https://your-target-site.com'); if (r.ok) console.log(JSON.stringify(r.data, null, 2)); else console.error(r.error); }); " If you find sites where the extraction breaks, open an issue — the heuristics improve with every edge case.
There is a specific kind of silence that falls in a war room after a breach. I've been in two of them. Not as the person responsible, but as the journalist who got the call. The first was at a mid-sized fintech in 2019. The second, more recently, was at a SaaS company that had been operational for less than eighteen months. In both cases, the root cause wasn't sophisticated. No nation-state actor. No zero-day that nobody had ever seen. In both cases, someone had built an API without thinking seriously about who — or what — would be on the other end of it. And the results were exactly what you'd expect when you hand a loaded system to the world with the safety off. I think about those rooms a lot when I read the breach reports. Which is often. The Scale of a Problem We Keep Pretending Is Solvable Later Let's start with numbers, because the numbers are damning. In 2025 alone, APIs accounted for 11,053 of the 67,058 published security bulletins — roughly 17% of all reported software vulnerabilities, making them one of the largest single attack surfaces in modern software. That figure has been climbing year over year, and the trajectory shows no signs of flattening. Nearly half of the newly added CISA Known Exploited Vulnerabilities in 2025 — 106 of 245, or 43% — were API-related. No other single surface comes close. Despite this, only 21% of organizations report a high ability to detect attacks at the API layer. And a mere 13% can prevent more than half of incoming API attacks. Read that again. Thirteen percent. In an era where APIs are the connective tissue of virtually every digital product and service — banking, healthcare, logistics, authentication, payments — the overwhelming majority of organizations cannot stop more than half of the attacks aimed at their most exposed surfaces. That's not a gap. That's a structural failure. And the reason it persists is not technical. The technology to build secure APIs exists. It has existed for years. The reason it persists is cultural: the industry keeps treating security as a phase of development rather than a dimension of it. A Brief and Uncomfortable History of Recent Mistakes To understand why security by design matters, you have to understand what security by neglect actually looks like at scale. The past eighteen months have been instructive. In February 2024, a leaky API at Spoutible exposed user data, including bcrypt-hashed passwords. In March, nearly 13 million API secrets were exposed through public GitHub repositories, leaving companies vulnerable as attackers exploited the credentials to gain unauthorized access. In April, critical vulnerabilities in PandaBuy's API led to the theft of data affecting 1.3 million users. In May, attackers accessed Dropbox's production environment via compromised API keys, exposing customer data and multi-factor authentication information. A separate incident that same year involved a buggy API that granted unauthorized access to 650,000 sensitive messages, leaked Office 365 credentials, and allowed a penetration tester to retrieve a trove of confidential communications. A Trello API exposure compromised over 15 million users by linking private email addresses with public Trello account data. These are not edge cases. They are the mode. The average, repeated, utterly predictable outcome of building fast and securing later. But the incident I keep returning to — the one that should have been a defining moment of reckoning for how technical teams think about credential management — happened in July 2025. Marko Elez, a 25-year-old DOGE employee with access to sensitive databases at the Social Security Administration, the Treasury and Justice departments, and the Department of Homeland Security, committed a code script to GitHub called "agent.py" that included a private API key for xAI. That single exposed key unlocked access to at least 52 large language models, including one called "grok 4-0709" created just four days before the leak. Here is the part that matters most: after security researcher Philippe Caturegli of Seralys alerted Elez to the exposure, the GitHub repository was removed — but the API key itself was not revoked, and access to the models remained active. The repo was gone. The damage was still live. Tom Pohl, Director of Penetration Testing at LMG Security, put it bluntly: "If you can't rotate a key without rebuilding or redeploying code, you don't own the key — it owns you." That sentence deserves to be printed and framed in every engineering office that has ever shipped a credential inside a config file. Caturegli was even more pointed: "One leak is a mistake. But when the same type of sensitive key gets exposed again and again, it's not just bad luck — it's a sign of deeper negligence and a broken security culture." And this, right here, is the core problem. It was not the first time a DOGE staffer had leaked an xAI key. It was the second, the first having been discovered in May of the same year, with keys granting access to custom LLMs built on Tesla and SpaceX internal data. Same organization. Same class of mistake. Different month. A broken security culture doesn't produce one incident. It produces a pattern. What "Security by Design" Actually Means — and What It Doesn't Security by design is a phrase that has been so thoroughly absorbed into vendor marketing that it has nearly lost all meaning. Every platform claims it. Every white paper invokes it. Most of them are describing something considerably less rigorous than the words suggest. What it actually means is this: security properties are not features you add to a system. They are constraints under which you build one. The difference is not semantic. It is architectural, and it shows up in every technical decision the team makes from the first commit forward. There is a startup — cloud-native, public-cloud Kubernetes deployment, handling user profile data and financial transactions — whose build process I've been examining closely. They had six months, a small team, regulatory obligations around data protection and access logging, and a performance mandate that ruled out heavyweight solutions. Exactly the kind of constraints that, in most shops, produce the decision to defer security work until post-launch. They didn't defer it. What they did instead is worth studying in detail. Authentication: The 15-Minute Decision That Changes Everything The team chose short-lived JWT access tokens with a 15-minute expiration window. This sounds minor. It isn't. A JWT consists of three parts: a header, a payload, and a signature. The signature exists to guarantee that the data transmitted in the token hasn't been tampered with. If signature verification is missing or improperly implemented, an attacker can forge the token entirely — changing the user identifier in the payload to point to a different account and gaining unauthorized access to that user's data. This is not a theoretical attack. It has been the root cause of real production breaches in the past two years. JWT misuse is consistent: APIs accept unsigned tokens — the so-called "alg=none" vulnerability — or fail to rotate signing keys on any predictable schedule. Both failures extend the window during which a compromised token remains useful to an attacker. A 15-minute expiration collapses that window. It doesn't eliminate the risk of token theft, but it radically limits what theft can accomplish. The operational cost was real. Building a secure refresh flow and revocation mechanism added engineering complexity the team's timeline didn't easily accommodate. They built it anyway. The logic was simple: a token that expires in 15 minutes is a recoverable problem. A token valid for eight hours — or one with no expiration claim at all — is an open door with a handshake. What they also did, which is less commonly discussed, was enforce rate limiting on authentication endpoints specifically. Authentication endpoints with no rate limiting are exactly what credential stuffing campaigns are designed to exploit. Removing that surface isn't complex. It is, however, a decision that has to be made early, because adding it to a live production system that wasn't designed with it creates friction — and friction, in engineering teams under delivery pressure, tends to lose. Authorization: The Boring Problem That Breaks Everything If authentication is who you are, authorization is what you're allowed to do. Most security discourse focuses on authentication — it's the dramatic failure mode, the stolen password, the compromised token. Authorization failures are quieter and, in practice, significantly more common. The startup implemented role-based access control from day one, with authorization checks enforced at every endpoint — not just at the UI layer, not just at the gateway, at the endpoint. Authorization checks must happen at every API endpoint. Access should be granted only to permitted resources, based on user roles and the sensitivity of the resource being requested. This sounds like an obvious design principle. It is frequently violated. Consider what happens when it isn't: a backend API endpoint left unauthenticated generates an OAuth 2.0 app-only access token for Microsoft Graph via the client credentials flow. The token carries high-privilege application permissions — User.Read.All, enabling complete directory enumeration. Since no authentication or caller restrictions were enforced, anyone on the internet could obtain a valid Graph token and directly query Microsoft Graph endpoints, exposing the information of over 50,000 Azure AD users at a single organization. The misconfigured API in that case wasn't a legacy system running on forgotten infrastructure. It was a modern integration with a modern identity provider, built without authorization checks because nobody on the team had stopped to ask: what happens if someone calls this endpoint who shouldn't be calling it? The startup asked that question at the beginning. They started with broader roles, refined them incrementally as the product matured, and made least-privilege a principle rather than an optimization. It added policy complexity. It also meant no single compromised credential could traverse the system laterally. Input Validation: Why Allow-Lists Win The team chose strict allow-lists for request validation — every field, every endpoint, every time. The distinction between allow-listing and block-listing matters more than most developers appreciate. Block-listing is intuitive: you identify known bad inputs and reject them. The problem is that the set of known bad inputs is never complete. Attackers have been innovating on injection techniques for decades. Any block-list you write today will have gaps tomorrow. Allow-listing inverts the logic. You define exactly what is acceptable — specific data types, character sets, length constraints — and reject everything that falls outside those boundaries. It is more rigid to implement and requires more upfront design work. It is also substantially more effective, because it doesn't depend on the defender knowing what the attacker will try. In 2025, injection attacks dropped from first to second place in API attack volume — but remained in the top two every single quarter. They are particularly relevant as AI-driven APIs pass untrusted input directly into models and downstream pipelines. The migration of business logic into AI-backed APIs hasn't reduced the injection surface. It has expanded it, because an LLM that processes untrusted text is an injection target with additional downstream consequences. Rate limiting ran alongside validation. The team set conservative per-user thresholds — tight enough to curb abuse, loose enough not to block legitimate traffic. They accepted minor throughput overhead in exchange for suppressing malicious burst patterns. Insecure resource consumption — driven by automated scraping, enumeration, and denial-of-service patterns — rose from seventh place in 2024 to fourth in 2025 and held that position through the year. Rate limiting is not a performance feature. It is a defense against a threat class that has been growing consistently for two years. Secrets Management: The Problem That Keeps Appearing in Headlines The startup used a managed secrets vault with automatic rotation. No credentials existed in the codebase. No API keys in config files. No database passwords in environment variables committed to version control. This sounds basic. It is, in fact, the single most commonly violated principle in production API security. GitGuardian found more than 10 million secrets exposed in public repositories in a single year. The DOGE/xAI incidents weren't anomalies. They were illustrations of the norm — the everyday practice of developers treating credentials as configuration rather than secrets, embedding them in code because it's convenient, and discovering the cost of that convenience only after something goes wrong. LMG Security's Tom Pohl noted at DEF CON that he's found Apple- and Google-blessed TLS certificates with their private keys embedded in Fortinet firewall firmware — not expired, valid production certificates — by simply unzipping firmware and searching for keywords. Hardcoded admin credentials in network appliances, AES keys in compiled Java JARs, authentication tokens in printer firmware. These aren't advanced techniques to find. They are basic. The startup's architecture made this entire class of exposure impossible by design. The vault handled issuance and rotation. No developer ever touched a raw credential. Initial setup took time. Ongoing rotation policies added maintenance overhead. The tradeoff was explicit: accept operational complexity now, or accept the risk of a credential aging quietly in a repository until someone finds it, which, based on the data, will happen. DevSecOps: The Pipeline That Complains Until It Matters The team wired static code analysis, dependency scanning, and container-image checks into the CI/CD pipeline on every commit. The first two weeks, by the lead developer's own account, were genuinely annoying. Builds slowed. False positives fired. Developers had opinions about this. Then the pipeline caught a vulnerable dependency in a third-party authentication library before it reached production. A real vulnerability, in a library the team was actively using, was caught before it became a runtime problem. The complaints stopped. GitLab's 2024 Global DevSecOps Survey found that while 56% of developers release code multiple times daily, only 29% have fully integrated security into their workflows. That gap is where the exposure lives. The velocity of modern development — multiple deployments per day, hundreds of dependencies, automated container builds — creates a surface area that no human review process can cover consistently. Automated scanning doesn't slow development down in any meaningful sense. What it does is enforce a consistent standard at a pace that matches the delivery cadence. The container-image scanning deserves specific attention. Kubernetes deployments in public cloud environments create a supply chain: every image that runs in a pod is either verified or trusted on faith. When an organization integrates a third-party service via an API, it inherits the security posture of that vendor — and vetting that posture is not a one-time event. It requires continuous assurance as the vendor's environment changes. Scanning every image on every commit is the only way to catch the moment when that inherited posture degrades. The Architecture That Doesn't Make Headlines There is something worth acknowledging about this startup's outcome: it is, on its face, unremarkable. The API launched on schedule. No major incidents in production. No breach notification letters. No postmortem was published to a shocked engineering community. The compliance audit found nothing to flag. The system performs within the latency targets the product team required. This is what success looks like in security. Not a dramatic rescue. Not a last-minute patch before a zero-day hit production. Nothing happening — because the conditions for something happening were designed out from the beginning. Only 13% of organizations can prevent more than half of API attacks. The startup is in that 13%, not because they had a larger security budget or a more experienced team. They had six months and a limited headcount. They are in that 13% because they decided, at the beginning, that security was a design constraint rather than a delivery risk. That decision compounded. Short-lived tokens meant that when credentials inevitably cycle through exposure risk — every public API has this surface — the blast radius was bounded by time. RBAC enforced from day one meant no credential, however obtained, could traverse the full system. Allow-list validation meant the injection surface never existed in the first place. Vault-managed secrets meant the DOGE scenario — the credential in the commit, the key that keeps working after the repo comes down — was structurally impossible. These controls did not add up to a sum greater than their parts. They composed. Each one reduced the value of defeating the others. The Debate That Needs to Happen Here is where I want to be direct, because there is a conversation the industry is not quite having, honestly. Security by design is often framed as a best practice — something well-resourced teams do when they have the luxury of time and the maturity to prioritize it. The implicit message is that it's an ideal, not an expectation. That startups with six-month timelines and small teams should be forgiven for the security debt they accumulate, because they were moving fast, and the alternative was not shipping. I think this framing is doing serious damage. And I think the damage is not abstract. When the Trello API exposed 15 million users' private email data, those were real people. When the Spoutible breach surfaced bcrypt-hashed passwords, those were real credentials that real attackers ran real cracking attempts against. When a ChatGPT plugin vulnerability sat unpatched for nearly a year while proof-of-concept exploit code was publicly available, and then received over 10,000 exploitation attempts from a single IP address within a single week in March 2025 — those were real API consumers, real integrations, real downstream systems exposed. The cost of retrofitting security is not paid by the engineering team that deferred it. It is paid by the users who trusted the product. IBM's 2024 Cost of a Data Breach report established the global average breach cost at $4.88 million. That number includes incident response, regulatory exposure, reputational damage, and customer churn. It does not include the class action exposure that follows significant PII breaches, the partner contract reviews that get triggered by security incidents, or the months of engineering work that go into rebuilding user trust after a disclosure. The startup in this case study spent engineering hours upfront on refresh token flows, RBAC policies, and vault configuration. I would estimate — generously — a few weeks of additional development time across the team. That is the cost of security by design for a product of this scale. The cost of the alternative is measured in a different currency entirely. What the Next Eighteen Months Will Make Worse There is a dimension to this problem that the industry is only beginning to grapple with seriously. Of the 2,185 AI vulnerabilities identified in 2025, 36% also qualified as API vulnerabilities. Among AI-related Known Exploited Vulnerabilities, the overlap was identical — 21 of 58 exploited AI vulnerabilities involved APIs directly. As AI matures, its risks don't shift elsewhere. They still come through APIs. The integration of LLMs into production systems has expanded the API attack surface in a specific and poorly understood way. When a user input reaches an LLM endpoint, it is no longer just a request for data. It is an instruction to a system that generates outputs, triggers downstream actions, and in agentic configurations, executes code. Injection attacks against these endpoints don't just exfiltrate data — they can redirect behavior, manipulate outputs, and compromise the integrity of anything the model produces. The Model Context Protocol, which serves as the control-plane API for autonomous agents, had already accumulated 315 documented vulnerabilities as of 2025, accounting for 14.4% of all AI vulnerabilities. From Q2 to Q3, MCP vulnerabilities increased by 270%. The common failure modes are familiar: over-permissioned tools, direct API access without adequate authentication and authorization, and the absence of runtime enforcement. The same failures that produced the Trello breach. The same failures that produced the DOGE API key incidents. The same failures that have been producing API breaches for a decade, now running on infrastructure that can act autonomously in response to compromised inputs. Security by design is not a practice that AI-era architecture has made optional. It's one that the AI era has made urgent. Five Things That Are True and Worth Arguing About I want to close with positions, not summaries. These are the things I believe the evidence supports, and the things I expect reasonable engineers to push back on. 1. Short token lifetimes are not an operational burden. They are an operational discipline. The argument against 15-minute JWTs is always some version of "the refresh flow is complex." The counterargument is what happens when a 24-hour token belonging to an admin user gets harvested from a compromised device. Complexity in the refresh mechanism is a solved engineering problem. A valid admin token circulating in attacker infrastructure for 24 hours is not. 2. DevSecOps scanning is not optional at modern delivery velocities. If your team ships multiple times per day, human review cannot maintain consistent security coverage across that surface. Automation doesn't replace judgment. It enforces the standards that judgment has already established, at the speed the pipeline requires. 3. Secrets in code are not a developer error. They are an architectural failure. If the path of least resistance in a codebase is to put a credential in a config file, the architecture created that path. Pre-commit hooks, automated scanning, and vault integration don't prevent this class of exposure by catching it after the fact. They prevent it by making the wrong path harder than the right one. 4. RBAC granularity and security are not in tension. The argument that fine-grained access controls are too complex to maintain is, in practice, an argument that the team hasn't built tooling to manage them. That's a different problem. Broad permissions aren't simpler — they're deferred complexity that manifests as blast radius during an incident. 5. The industry needs to stop calling security a best practice. Best practices are things you do when you have the resources and culture to do them. Security is a property of the system that either exists or doesn't. If it doesn't exist at launch, the users bear the cost — not the engineering team, not the investor, not the person who made the timeline call. The people who trusted the product. The Unglamorous Conclusion The startup I described in this piece didn't do anything novel. There are no proprietary techniques here, no advanced threat modeling frameworks that require external consultants, no six-figure tooling budget. The OWASP API Security Top 10 has documented the dominant failure modes for years. The defenses are known. The implementation patterns are well-established. The engineering patterns — vault-managed secrets, short-lived tokens, RBAC, allow-list validation, CI/CD scanning — are all things that every engineering team working on a production API could implement on a standard startup budget. What this team had was not resources. It was a decision, made early and maintained under pressure, that security was a design constraint and not a delivery variable. They treated every tradeoff explicitly — token lifetime versus convenience, RBAC granularity versus overhead, scan depth versus build speed — and made those tradeoffs in writing, with awareness of what they were accepting in each direction. That is security by design. Not a posture. Not a framework. A decision about what kind of architecture you are building, made before the architecture exists. The alternative — and the industry's dominant practice — is to build the architecture, ship it, and discover what kind of security it has when someone tells you what they found. Brute force attacks moved into the top three API breach methods in 2025. DDoS and fraud remain the most frequent vectors. Injection hasn't left the top two in any quarter of the year. None of this is new intelligence. None of it is surprising to anyone who has been reading the threat reports. The gap isn't knowledge. The gap is will — and sometimes, a concrete model of what it looks like when someone actually closes it. This analysis is grounded in documented case study materials, publicly reported breach data, and open-source threat research. The startup referenced declined attribution. All technical claims are independently sourced and footnoted above.
My data catalog project was the third time in my career that I had led a catalog implementation. My first was a custom-built solution in 2015 that worked but required three engineers to maintain. Number two was an off-the-shelf tool that nobody used because it was too cumbersome to keep current. For this third attempt, I wanted to get it right. We implemented Azure Purview for automated discovery and technical metadata, and Collibra for business glossary, data ownership, and governance workflows. They serve different functions and are connected through a custom integration. Here is how we set it up and what surprised us. Why Two Tools? Azure Purview is excellent at automated technical metadata collection. Purview scans your data sources on a schedule, discovers tables and columns, infers data types, and builds an automatically-maintained lineage graph. Automated discovery is its primary value. Doing this manually doesn't scale, and any manually-maintained catalog falls behind the actual state of the data within months. Purview isn't good at business governance workflows: data stewardship, business term assignment, data quality certification, access request approvals. These require human processes with approvals and audit trails that Purview's workflow capabilities do not cover adequately. Collibra handles the governance workflow side. Business data stewards maintain the business glossary in Collibra. Ownership assignments and data quality certifications go through Collibra's workflow engine. When a data consumer wants to know what a dataset means in business terms, they look in Collibra. When they want to know where the data physically lives and what its schema is, they look in Purview. The Purview Setup Purview scans are configured per data source. We set up scans for our three ADLS Gen2 storage accounts, our Azure SQL databases, our Databricks Unity Catalog, and our Azure Data Factory pipelines. Scans run daily for production data sources and weekly for development. Purview builds a lineage graph from ADF pipelines, which is genuinely useful. We can see, for any given table, which pipelines write to it and which tables it reads from. Lineage tracking has been valuable three times in incident investigations where we needed to understand the upstream sources of a corrupted dataset. Custom classifications are worth the setup time. Purview comes with built-in classifiers for common PII patterns: email addresses, phone numbers, credit card numbers, and national ID formats for several countries. We added custom classifiers for our internal account number formats and insurance policy number patterns. Automated classification isn't perfect, about 85% accurate in our testing, but it surfaces PII-candidate columns that manual review would miss. Python # Purview scan configuration (REST API) import requests def create_purview_scan(account_name, collection, data_source): url = (f"https://{account_name}.purview.azure.com/scan/datasources/" f"{data_source}/scans/daily-production-scan") body = { "kind": "AzureStorageMsi", "properties": { "scanRulesetName": "custom-pii-ruleset", "scanRulesetType": "Custom", "collection": {"referenceName": collection}, "credential": { "referenceName": "managed-identity", "credentialType": "ManagedIdentity" } }, "trigger": { "recurrence": { "frequency": "Day", "interval": 1, "startTime": "2024-01-01T02:00:00Z", "timezone": "UTC" } } } resp = requests.put(url, json=body, headers=get_auth_headers()) return resp.json() # Custom classifier for internal account numbers custom_classifier = { "kind": "Custom", "properties": { "classificationName": "INTERNAL_ACCOUNT_NUMBER", "description": "Internal 12-digit account number format", "classificationRule": { "kind": "Regex", "pattern": "^ACC[0-9]{9}$", "minimumPercentageMatch": 75 } } } The Collibra Integration We built a nightly sync that reads technical metadata from Purview via its REST API and creates or updates corresponding assets in Collibra. Our sync maps Purview datasets to Collibra data assets, adds technical metadata (schema, classification, lineage summary) as attributes on the Collibra asset, and creates a link between the Collibra and Purview assets so users can navigate between the business and technical views. Building this sync took about six weeks of engineering time. It's the part of the implementation I considered most for an off-the-shelf connector, but the available connectors didn't handle our specific Purview classification tagging approach correctly. Our custom sync has been running for 14 months with minimal maintenance. Python # Nightly Purview-to-Collibra metadata sync (Python) import requests from datetime import datetime def sync_purview_to_collibra(purview_client, collibra_client): """Sync technical metadata from Purview to Collibra assets.""" # Fetch all cataloged assets from Purview purview_assets = purview_client.discovery.query( keywords="*", filter={"and": [ {"entityType": "azure_datalake_gen2_path"}, {"classification": ["confidential", "restricted"]} ]}, limit=1000 ) for asset in purview_assets['value']: collibra_asset = collibra_client.find_or_create_asset( name=asset['name'], domain="Data Lake Assets", type="Data Set" ) # Sync technical metadata as attributes collibra_client.update_attributes(collibra_asset['id'], { "Technical Schema": asset.get('schema', ''), "Data Classification": asset.get('classification', []), "Purview Asset Link": asset['id'], "Last Scanned": asset.get('lastScanTime', ''), "Lineage Summary": get_lineage_summary( purview_client, asset['id']), "Sync Timestamp": datetime.utcnow().isoformat() }) return {"synced": len(purview_assets['value']), "timestamp": datetime.utcnow().isoformat()} What Adoption Looked Like Adoption was slow. We launched the catalog with a communication campaign, internal documentation, and a live demo. After three months, we'd had about 30% of our target user base actively using it, primarily data engineers who were looking up lineage information. Analysts and business stakeholders, the people Collibra was primarily designed to support, were largely not using it. Adoption really broke through when we integrated the catalog with our data access request process. Previously, access requests went to a Jira form. We changed the process: to request access to a dataset, you start from the Collibra data asset page. Each access request is contextualized with the asset's classification, ownership, and purpose, which both the requester and the approver see during the approval workflow. Usage of Collibra for data assets grew by 300% in the month after we made this change. Python # Collibra asset mapping schema for access request workflow { "asset_type": "Data Set", "domain": "Data Lake Assets", "attributes": { "Technical Name": {"type": "text", "source": "purview"}, "Business Name": {"type": "text", "source": "steward"}, "Data Classification": { "type": "single_select", "values": ["public", "internal", "confidential", "restricted"], "source": "purview" }, "Owner Team": {"type": "text", "source": "steward"}, "PII Columns": {"type": "multi_select", "source": "purview"}, "Quality Certification": { "type": "single_select", "values": ["certified", "provisional", "uncertified"], "source": "steward" }, "Access Request URL": { "type": "url", "template": "https://collibra.internal/access/{asset_id}" } }, "workflow": { "access_request": { "approvers": ["asset_owner", "data_governance_lead"], "sla_hours": 48, "auto_revoke_days": 365 } } } The Honest Caveat A data catalog requires ongoing investment that is easy to underestimate. Automated parts, Purview's scanning and discovery, take care of themselves. Business governance parts, glossary maintenance, stewardship assignments, and quality certifications require human effort that must be budgeted and owned. Our Collibra business glossary currently covers about 60% of our production datasets. The remaining 40% have technical metadata from Purview but no business context. That 40% is smaller than it was six months ago, which means we are making progress. But it's a real gap that we manage explicitly rather than pretending the catalog is complete.
Partitioning and Z-Ordering have long been fundamental techniques in Delta Lake for optimizing data layout and query performance. However, these methods require significant upfront design and ongoing maintenance and they often struggle to adapt to changing data and query patterns. Databricks Liquid Clustering introduced with Delta Lake 3.0 goes beyond traditional partitioning and Z-Order, offering a self-tuning, flexible approach to organizing data that is especially powerful for Unity Catalog managed tables. In this article, we’ll explore how Liquid Clustering works, how it compares to traditional methods, and how to implement it in Databricks Unity Catalog for improved performance and simpler data management. Recap: Partitioning and Z-Order Limitations Before diving into Liquid Clustering, it’s important to understand the challenges of conventional partitioning and Z-Ordering in large Delta Lake tables: Design Complexity & Rigidity: Choosing an optimal partitioning scheme is difficult and usually fixed. A static Hive-style partition strategy often demands careful upfront planning to avoid data skew and concurrency conflicts and it cannot easily adapt if query patterns change. Changing partition columns later means expensive data rewrites.Partition Explosion & Metadata Overhead: If you partition on high-cardinality columns or many levels, you may end up with too many small partitions. This proliferation of tiny files and directories increases metadata overhead and slows down query planning.Need for Additional Clustering (Z-Order): Z-Ordering is often applied on top of partitions to co-locate related data. While Z-Order can improve data skipping, it is expensive to maintain it requires heavy shuffle and rewrite jobs and does not handle concurrent writes well. In other words, Z-Ordering jobs can be lengthy and costly and must be re-run as new data arrives to maintain clustering.Manual Tuning & Maintenance: Both partitioning and Z-Order require continuous tuning. Data engineers must monitor query patterns and manually decide how to partition or when to re-Zorder. This ongoing maintenance is time-consuming and error-prone. In summary, traditional partitioning/Z-ordering yields performance benefits but at the cost of rigidity and operational overhead. This sets the stage for a more adaptive solution. What Is Liquid Clustering? Liquid Clustering is a new data layout strategy in Databricks Delta Lake designed to replace traditional partitioning and Z-Ordering for Delta tables. The name liquid signifies flexibility data is clustered by one or more columns in a way that can evolve over time without strict, static partitions. Key characteristics of Liquid Clustering include: Dynamic, Self-Tuning Layout: Instead of static partitions, data is dynamically clustered based on specified clustering keys. The table’s storage layout automatically adjusts to changing data and query patterns, incrementally clustering new data as it is written. This means the data layout flows with your workload.Simplicity in Key Selection: You choose a set of clustering columns based on query access patterns, typically the columns most commonly used in WHERE filters or joins. You don’t need to worry about column cardinality, order of keys or file size tuning the platform handles optimal file sizing and clustering internally. Even high-cardinality columns can be used effectively, which would be impractical as partition keys.Flexibility to Change Keys (No Rewrites): Perhaps the most revolutionary aspect is that clustering keys can be redefined without rewriting existing data files. If your query patterns shift, you can alter the clustering columns and the system will gradually reorganize data for the new keys. There’s no massive upfront cost of re-partitioning the entire dataset past data doesn’t need an immediate rewrite.Skew-Resistant & Efficient Storage: Liquid Clustering is designed to maintain balanced file sizes and avoid the pitfalls of skewed partitions. Under the hood, the data engine can combine or split clustering ranges to keep files at an optimal size.Reduced Maintenance Overhead: Because the data layout adapts automatically, the need for manual maintenance is drastically reduced. You no longer have to schedule regular Z-Ordering jobs or hand-tune partition schemes. Liquid Clustering, especially in its automatic mode, offloads these decisions to Databricks. Databricks recommends using Liquid Clustering for most new Delta tables going forward, especially for tables that are large, have high-cardinality filter columns, experience data skew, or have evolving access patterns. It simplifies data engineering by set it and forget it clustering. In fact, thousands of customers have already adopted it as of 2025, over 3,000 monthly customers were writing 200+ PB of data into Liquid Clustered tables. Liquid Clustering vs Traditional Methods Liquid Clustering addresses the limitations of partitions and Z-ordering in several ways: No Rigid Partition Boundaries: Unlike Hive partitions, liquid clustering can store a range of values in each data file. This fluid layout avoids issues like tiny partitions or unbalanced file sizes.Incremental and Low-Shuffle Clustering: New data is clustered as it’s ingested, without requiring a full table rewrite. When you enable clustering on a table, Databricks flags the table to cluster future writes according to the specified keys. Each new INSERT or MERGE automatically writes out files clustered on those keys, and small files are merged as needed. This incremental approach means no huge one-time sort jobs every time you add data. Maintenance operations like OPTIMIZE still play a role but they can operate more efficiently since the incoming data is already sorted/clustered on write. Notably, the OPTIMIZE command for a liquid-clustered table can be more adaptive than traditional OPTIMIZE+ZORDER it only rearranges data that isn’t well clustered yet rather than always rewriting everything.Adapting to Change Without Rewriting Everything: In a partitioned table, if you realize a month later that queries would run faster partitioned by a different column, you’d have to repartition the entire dataset. With Liquid Clustering, you can simply issue an ALTER TABLE to change the clustering column set. The system will use the new keys for all future writes, while existing files remain as they are until an optimization is triggered. You can later run a full optimize to reorganize historical data under the new scheme if needed. This means you can respond to evolving query patterns without incurring an immediate cost for reprocessing the whole table.Better Concurrency and Fewer Conflicts: Because Liquid Clustering avoids overly granular partitions and heavy-duty clustering jobs, it also mitigates concurrency problems. Traditional partitions can suffer write conflicts if too many jobs target the same partition, and Z-order optimize jobs can conflict with concurrent writes. Liquid Clustering’s design results in fewer such bottlenecks.Performance Gains: Ultimately, the goal is faster queries and lower cost. By clustering data on the actual query predicates, Liquid Clustering improves data skipping. This leads to less IO and faster execution. In one benchmark, Databricks observed that a 1 TB warehouse dataset clustered with Liquid Clustering ran 2.5× faster to optimize (cluster) than using Z-Ordering, and yielded significantly better query performance than both partitioning or Z-Order. In real workloads, users have reported dramatic improvements; for example, Healthrise (a Databricks customer) saw some queries run up to 10× faster after enabling Automatic Liquid Clustering on their tables. We’ll discuss Automatic mode shortly. How Liquid Clustering Works (Under the Hood) At a high level, manual Liquid Clustering works by clustering data files on chosen key columns, while automatic Liquid Clustering adds an intelligent layer to choose and adjust those keys for you. Let’s break down the mechanisms: Clustering on Write: When you define clustering keys for a Delta table, the Delta engine ensures that newly written data is organized according to those keys.Maintenance and OPTIMIZE: Over time, as data is appended, you may still accumulate some fragmentation. The OPTIMIZE command can be used on a clustered Delta table to compact small files and sort data more finely according to the clustering columns. Unlike Z-Ordering, an optimize on a liquid-clustered table doesn’t always have to rewrite all files it focuses on incremental clustering, merging files that are sub-optimally placed. You can think of it as tightening the clustering. If you change the clustering columns via ALTER TABLE, you can run OPTIMIZE FULL to recluster all existing records under the new key order. In normal operation, Databricks recommends running periodic OPTIMIZE to keep performance optimal, but these operations are more lightweight than traditional heavy Z-order jobs.Data Skipping with Statistics: Delta Lake maintains statistics that the query engine uses for data skipping. Liquid Clustering maximizes the effectiveness of data skipping by ensuring those min/max ranges align with query filters. Enabling Automatic Clustering To use Automatic Liquid Clustering, you need to have Predictive Optimization enabled for your workspace (this is the feature in Unity Catalog that handles these background optimizations). Many new Databricks accounts have this on by default since late 2024, but it can also be enabled via the account console (under Feature Enablement). Assuming it’s enabled, turning on Automatic clustering for a table is straightforward: SQL: Use the CLUSTER BY AUTO clause when creating or altering a Delta table. For example, to create a new table in Unity Catalog with auto clustering: SQL -- Creating a Unity Catalog managed table with Automatic Liquid Clustering CREATE TABLE main.analytics.user_events ( user_id STRING, event_type STRING, event_date DATE, details STRING ) CLUSTER BY AUTO; -- enables automatic liquid clustering on this table SQL ALTER TABLE main.analytics.user_events CLUSTER BY AUTO; This instructs Databricks to begin monitoring the table’s workload and to auto-select clustering keys for optimal performance. The table does not need to have any manual keys set; the system will determine them. (Under the hood, the first time it chooses keys, it will update the table’s metadata with those columns as clustering keys.) PySpark API: In code, you can also enable auto clustering when writing data. For instance, using the DataFrame Writer API in PySpark: Python # df is a DataFrame we want to save as a Delta table with auto clustering df.write.format("delta") \ .option("clusterByAuto", "true") \ .mode("overwrite") \ .saveAsTable("main.analytics.user_events_auto") The above will create the user_events_auto table as a Unity Catalog managed table with automatic clustering enabled. (If you want to provide an initial hint for clustering columns, you can combine .clusterBy("col1", "col2") with the clusterByAuto=true option, but it’s not required – the system will figure it out if you leave it open.) Once Automatic mode is on, no further action is needed from the user. Databricks will handle running background optimize jobs as needed. It’s worth noting that these maintenance operations run on a serverless compute in the background. The benefit is you no longer need to schedule OPTIMIZE or VACUUM on your own; predictive optimization will run them at optimal times. Using Manual Liquid Clustering (Custom Clustering Keys) In some cases, you may want to manually specify the clustering columns. Unity Catalog supports manual Liquid Clustering on managed tables as well. Here’s how to use it: Table Creation with Cluster Keys: You can define clustering keys in the CREATE TABLE statement via a CLUSTER BY clause. For example: SQL -- Create a Delta table clustered by specific columns (manual clustering) CREATE OR REPLACE TABLE main.analytics.sales_data ( sale_id BIGINT, region STRING, product STRING, sale_date DATE, amount DECIMAL(10,2) ) CLUSTER BY (region, sale_date); In this example, the table’s data will be clustered by region and sale_date. This means each file written will tend to contain a narrow range of region values and sale_date values. This is analogous to creating a partitioned table on multiple keys, but without creating separate directories for each region or date. Altering an Existing Table: If you have an unpartitioned Delta table and want to enable clustering on it, use an ALTER statement. For instance: SQL ALTER TABLE main.analytics.sales_data CLUSTER BY (region, sale_date); This will register region and sale_date as the clustering keys for sales_data. As mentioned, this does not rewrite existing files immediately. It flags the table so that future writes will be clustered by these keys. Any new data you append or merge into sales_data will now be written in clustered order. Data that was already in the table remains in its original layout until you optimize. Reclustering Existing Data: To apply the new clustering to old files, you can run an OPTIMIZE operation. For a large table, you might do this during a maintenance window. For example: Python OPTIMIZE main.analytics.sales_data; The above will compact small files and cluster data incrementally. If you recently changed the clustering keys and want to force a full re-cluster of all data under the new key order, use OPTIMIZE main.analytics.sales_data **FULL**. An OPTIMIZE FULL will read and rewrite all files in the table, arranging them according to the current clustering columns. In most cases, a regular OPTIMIZE will suffice, as it will naturally pick up new keys over time. PySpark Write with Clustering Keys: You can also write data from Spark with clustering, similar to how you’d write partitioned data. For example: Python # Given a Spark DataFrame df, write it to a Delta table with clustering on specified keys df.write.format("delta") \ .mode("append") \ .clusterBy("region", "sale_date") \ .saveAsTable("main.analytics.sales_data"); Here, .clusterBy("region", "sale_date") ensures the data in df gets written out clustered by those columns. If the table sales_data was not already created, this will create it with those cluster keys. Finally, remember that Liquid Clustering is supported only on Delta tables with the latest protocols. Enabling it will bump your table’s Delta protocol version which older clients cannot read. In a Databricks environment this is usually not an issue, but be cautious if you have external readers/writers that might be using older Delta Lake libraries. Conclusion Liquid Clustering represents a major evolution in data layout management for the Lakehouse. By moving beyond the rigidness of partitioning and the heavy operational cost of Z-Ordering, it delivers a simpler and more adaptive way to optimize tables. For Data Engineers, this means less time agonizing over partition strategies and maintenance jobs, and more time focusing on data and insights. With Unity Catalog’s Automatic Liquid Clustering, the process is taken a step further clustering becomes a self-driving process, leveraging query insights to continuously improve performance. In summary, Databricks Liquid Clustering dynamically organizes data based on actual usage, can adjust without expensive rewrites, and has been shown to boost query performance significantly. As you design your next Delta Lake tables in Unity Catalog, consider leveraging Liquid Clustering from the start it can simplify your architecture and ensure your tables automatically stay optimized as your data (and its use cases) grow.
EMR platforms are unique software beasts. They must live longer than most online apps due to regulatory constraints. A startup may reinvent its primary product every three years, but an EMR system must retain data integrity and workflow consistency for decades. This lifespan is difficult. How do you change a healthcare system without violating strict compliance rules? API-first thinking is the answer. This method goes beyond data endpoint exposure. The issue is architectural survival. In a business where "move fast and break things" is unacceptable, architects may offer modular development, safer changes, and long-term stability by prioritizing the API. The Unique Constraints of EMR Architecture EMRs are not typical CRUD applications. In a standard business app, updating a record might just mean overwriting a row in a database. In healthcare, that simple update triggers a cascade of regulatory realities. Every change requires an audit trail. Data retention policies dictate that information cannot simply vanish. Clinical decisions are based on the history of that data, meaning immutability is often more important than mutability. Furthermore, healthcare workflows are long-lived. A patient's treatment plan might span months or years. An architecture built around short-lived features will crumble under the weight of these persistent workflows. You cannot refactor a database schema overnight if it breaks the continuity of a patient's care record. This is why stability is the paramount quality attribute of any EMR. What “API-First” Really Means in Regulated Systems In the context of regulated systems, API-first means designing contracts before writing a single line of implementation code. It requires treating your APIs as long-term public interfaces, even if the only consumer initially is your own frontend team. They are not internal shortcuts; they are binding agreements. This approach forces you to separate clinical workflows from user interface concerns. A button click on a screen is transient; the clinical action it represents is permanent. By defining the API first, you establish a boundary that encapsulates compliance logic. The API becomes the gatekeeper. It enables regulatory compliance regardless of data access via mobile app, web portal, or third-party integration. Contract Stability as a Core Architectural Principle Breaking an API contract in an EMR is far costlier than breaking a UI component. If a button breaks, a user complains. If an API contract breaks, integrations fail, data synchronization stops, and patient care can be impacted. Therefore, request and response models must be designed to survive years of change. Architects must avoid overfitting contracts to current UI needs. Just because a specific screen needs a patient's name and their last three blood pressure readings doesn't mean you should create an endpoint specifically for that view. Instead, design resources that represent the domain accurately. This decoupling protects the backend from the volatility of frontend trends. Backward Compatibility Without Freezing Innovation The fear of breaking existing clients often paralyzes development teams. However, API-first design provides a path to evolve without stagnation. The key is distinguishing between additive changes and destructive changes. Adding a new field to a response is generally safe; removing one or renaming one is not. In .NET Web APIs, versioning strategies are critical. You can support legacy consumers while enabling new features for modern clients. This transforms deprecation from a sudden emergency into a managed process. You provide a sunset period for old versions, giving consumers time to migrate without disruption. In regulated systems, versioning is not a technical afterthought. Explicit versioned routes allow EMR platforms to evolve safely, giving downstream systems time to migrate without disrupting clinical workflows. Plain Text ```csharp [ApiController] [Route("api/v1/encounters")] public class EncountersV1Controller : ControllerBase { [HttpPost("{id}/sign")] public IActionResult SignEncounter(Guid id) { // Business rule: encounter must be complete before signing _encounterService.Sign(id); return Ok(); } } Modeling Regulated Workflows Through APIs Your API should encode business rules and compliance constraints directly. It is dangerous to rely on the UI to validate clinical workflows. If a doctor must sign a note before billing can occur, that rule belongs in the API layer, not in the JavaScript of the frontend. Consistency: Business rules enforced at the API level apply to every consumer, preventing "workflow drift" between the web portal and mobile apps.Security: Bypassing the UI via a direct API call (e.g., using Postman) should not allow a user to bypass compliance checks.Clarity: The API endpoints should reflect real-world clinical states (e.g., POST /encounters/sign) rather than generic database operations. API-First and Modular EMR Growth Monolithic EMRs eventually become unmaintainable. Decoupling large domains like scheduling, assessments, reporting, and case management is possible with API-first design. Well-defined interfaces allow you to upgrade the scheduling engine without affecting the billing module. This modularity supports parallel development. Different teams can work on different modules simultaneously without constant merge conflicts or integration friction. It also lays the foundation for extensibility. If a client needs a custom integration for a specific device, your public-facing API is already robust enough to handle it because it’s the same API you use internally. .NET-Specific Considerations for API-First EMRs ASP.NET Core is an excellent framework for building long-lived API platforms. Its middleware pipeline allows you to handle cross-cutting concerns like logging and validation globally. However, structuring your solution requires discipline. Controllers should be thin, delegating logic to service layers that handle the heavy lifting. Using Data Transfer Objects (DTOs) is non-negotiable. Never give API consumers access to internal domain entities or Entity Framework models. DTO buffers allow database schema refactoring without breaching the public contract. Your architecture should prioritize validation, authorization, and detailed auditing over afterthoughts. DTO boundaries are a compliance safeguard. They allow internal schema evolution while preserving external contracts, critical for EMR platforms that must retain compatibility over decades. Plain Text ```csharp // Entity (internal, mutable, persistence-focused) public class EncounterEntity { public Guid Id { get; set; } public DateTime SignedAt { get; set; } public string InternalNotes { get; set; } } // DTO (public, stable, contract-focused) public class EncounterDto { public Guid Id { get; set; } public bool IsSigned { get; set; } } Security, Authorization, and Role-Based Access Authorization in healthcare is complex. It is rarely a simple binary of "admin" vs. "user." You have doctors, nurses, auditors, billing specialists, and patients, all with overlapping permissions. This complexity cannot be delegated to the UI. Scope: Design APIs around granular scopes and responsibilities, ensuring a nurse can view a chart but only a doctor can sign an order.Context: Authorization logic must understand the context. A doctor may see patients solely in their ward.Enforcement: Use.NET policies to enforce these restrictions at the controller or action level to catch all requests. Lessons Learned From Long-Term EMR Ownership Looking back at years of EMR development, the cost of early shortcuts is evident. Every time we bypassed the API to hack a feature directly into the database or coupled the UI too tightly to the backend, we paid for it with interest later. The API-first approach drastically reduced risk during major platform changes. When we needed to rewrite our entire frontend framework, the backend remained stable. We didn't have to reinvent our compliance logic because it was safely encapsulated behind our API contracts. I would tighten contract design reviews if I started over. Taking time to design the interface right is more important than coding speed. Final Thoughts: Building EMRs That Outlast Trends Technology trends fade. JavaScript frameworks rise and fall. But medical records must persist. An EMR system must survive multiple generations of UI rewrites and shifting regulatory landscapes. API-first design is the strategy for this longevity. It separates your system's volatile portions from its compliance-heavy core. Architects in this field must supply features and maintain system integrity throughout time. By investing in solid, well-designed APIs today, you assure your platform's longevity.
When we talk about data analytics the way we set up our tables is really important. This is because it can make a difference, in how well our systems work and how fast they can grow. Data analytics and Open Table Formats go hand in hand. Open Table Formats are a part of how we build our data systems today. They make it easy to work with systems. Get more out of our data. In this blog post we will talk about what Open Table Formatsre. We will discuss data analytics and Open Table Formats in detail. We will also look at some examples. Help you figure out which Open Table Format is best for your data analytics needs. We want to help organizations choose the Open Table Format for their data systems because the Open Table Format is very important, for organizations. The Open Table Format is what organizations need to make their data systems work well. What Are Open Table Formats? Open Table Formats are really good at keeping data neat and tidy, in tables. Nobody owns Open Table Formats so they are made to work with lots of tools and systems. This is great because Open Table Formats can be used by people and computers and they all work together. The goal of Open Table Formats is to make it easy for people to share data and use it so everyone can work together smoothly no matter what kind of computer or system they use, with Open Table Formats. Popular Open Table Formats People really, like using Open Table Formats when they are dealing with data. Here are some popular Open Table Formats that people use a lot when they are working with Open Table Formats: Apache Iceberg Apache Iceberg is a way to organize tables. It helps people work with sets of data in an controlled way. Apache Iceberg gives us things like ACID transactions, which's, like a guarantee that Apache Iceberg will handle our data correctly. Apache Iceberg also has isolation so we can look at our data without worrying about people changing Apache Iceberg data at the same time. Apache Iceberg allows for schema evolution, which means we can change the way our Apache Iceberg data is organized without having to start over again with Apache Iceberg. I think Apache Iceberg is really useful for people who deal with datasets in data lakes. Apache Iceberg is very helpful because it makes working with amounts of data a lot easier for people who do this kind of work, with Apache Iceberg. Advantages The main advantages of this system are that it makes sure the data is consistent. It helps with queries. This system also allows the database schema to change and evolve over time without losing any of the data, from the database schema. The system ensures data consistency. It supports queries and it enables the database schema evolution. Use Cases: Ideal for data lakes requiring transactional guarantees and schema flexibility. Delta Lake Delta Lake is a way to store data that's free for anyone to use. It helps make sure the Delta Lake data is reliable. When many people use the Delta Lake data at the time Delta Lake makes sure there are no problems. The Delta Lake also keeps track of a lot of information, about the Delta Lake data. Delta Lake makes it easy to use data that is coming in all the time and old data that is already stored in the Delta Lake. The Delta Lake does all this by using something called ACID transactions to help the Delta Lake work properly. Delta Lake is really great when it comes to dealing with an amount of data. Delta Lake works well with data that is coming in all the time and Delta Lake also works well with data that comes in big groups. This thing has a lot of points. It makes sure the data is good and reliable. You can also go back. Look at old versions of the data. The data works well with the tools that use the data. The tools that process the data, like it when the data is set up this way. Use Cases: Suitable for data lakes requiring reliability, data versioning, and unified data processing. Apache Hudi Apache Hudi is a tool for working with data. It helps you add data to the data you already have. Apache Hudi also makes it easier to build systems that can move data around. This is really helpful when you have a lot of data in a data lake. Anyone can use Apache Hudi because it is source. The best thing about Apache Hudi is that it makes handling data processing and building data pipelines on data lakes simpler. Apache Hudi is very useful, for people who work with data lakes and need to process a lot of data. This system is good because it helps with processing data a little at a time. It also keeps track of versions of the data. The data system makes it easy to get the data in and to ask questions about the data. The data system is really helpful when you want to ask questions, about the data. Use Cases: Ideal for data lakes requiring incremental data processing and data pipeline management. Choosing the Right Open Table Format When you are trying to pick the Open Table Format for the data analytics you need you have to think about a lot of things. You have to think about what you will be using the Open Table Format for. What is your data, like? Will the Open Table Format work with the systems you use? How well does the Open Table Format need to perform for your data analytics? Here are some important things to think about when you're trying to decide on an Open Table Format for your data analytics needs: Use Cases and Workloads When you want to make sure your transactions are safe and your data is consistent you should think about using formats like Apache Iceberg or Delta Lake. These formats give you something called ACID transactions which's, like a promise that everything will work correctly. Apache Iceberg and Delta Lake are options because they help you keep your data safe and make sure everything is consistent. If you are looking for something that will guarantee your data is safe Apache Iceberg and Delta Lake are the way to go because Apache Iceberg and Delta Lake give you this guarantee. When we talk about Incremental Data Processing we need to think about how to handle Incremental Data Processing. For people who work with Incremental Data Processing and manage data pipelines Apache Hudi is an option to consider for their Incremental Data Processing needs. Apache Hudi can really help with tasks related to Incremental Data Processing. Data Characteristics When you are working with data think about how data you will have to deal with. You have to store data. Some ways of storing data are better for sets of data. Data volume is something you should think about because some formats can handle lots of data better, than others. This is really important when you are working with a lot of data. If you are working with data data volume can be a problem if you are not using the format for your data. Data Complexity You have to find out how complicated your data is. This means you need to look at the types of data you have. You should think about if you will need to make changes to how your data's organized. Some data formats, like Apache Iceberg and Delta Lake are very helpful. They are helpful because they let you make changes to your data easily. You can change your data without a lot of trouble when you use Apache Iceberg and Delta Lake. Ecosystem Compatibility When you choose an Open Testing Framework you need to make sure it works well with the data processing tools you already use. For example Delta Lake works with Apache Spark. This is really important because you want your Open Testing Framework to be compatible with your existing data processing frameworks and tools, like your Open Testing Framework and your data processing tools. You want your Open Testing Framework to work smoothly with the tools you have so your Open Testing Framework and your data processing tools work together perfectly. When you think about Cloud Platforms you need to think about how the OTF works with the Cloud Platform you want to use. You have to see if the OTF is compatible with the Cloud Platform you like.. You have to check if it works with the infrastructure you have at home or in your office. This is really important for Cloud Platforms, like the ones you use every day. You need to make sure the OTF and the Cloud Platform work together. The Cloud Platform you choose should be able to work with the OTF. Performance Requirements Let us take a look at the On The Fly system and see how it works when we have to handle queries. The On The Fly system has to be able to handle our queries. We need to check how well the On The Fly system does when it comes to query performance. This is important because we do a lot of work. The On The Fly system has to be good, at handling the kind of work we do. We have to test the On The Fly system to see how it performs with our workloads. The On The Fly system needs to be able to handle these workloads. * We are going to take a look, at how the On The Fly system works when it comes to answering queries. We want to see how the On The Fly system does its job. The On The Fly system is what we are focusing on. * We are going to use this for the work we do when we analyze things for our workloads. This will help us with our workloads. The main thing we want to figure out is how good the On The Fly system is at doing our work. We need to see if the On The Fly system can give us the results we need fast. This will help us decide if the On The Fly system is really good, for the kind of work we do with the On The Fly system. Data Ingestion We need to check how well our Data Ingestion processes are working, especially when we are getting Data Ingestion done on time or really close to time for analytics. This is really important, for Data Ingestion because it helps us understand what is happening now with our Data Ingestion. We need to see how Data Ingestion works with a lot of information. We have to know how fast Data Ingestion can process this information. For Data Ingestion to be really useful it has to be able to handle all this information. Data Ingestion is only good if it can do this. Open Table Formats are really important for working with data these days. They make it easy to work with systems and Open Table Formats can do a lot of things. If you know what makes Open Table Formats like Apache Iceberg, Delta Lake and Apache Hudi special you can pick the Open Table Format that's best, for your company. You need to think about your data. What is your data like? You should figure out what you want to do with your data and what tools you are using with your data. You should also think about what you want your data to be like. Then you can pick the Open Table Format that's best for your data and what you want to do with your data. Open Table Formats are important for your data so choosing the Open Table Format is important, for your data needs.
TL;DR A single straggling node held up a 4-node distributed training job. We found it by fanning out one SQL query to all four nodes and getting the answer in under a second. This is distributed GPU training debugging with eBPF – no central service, no Prometheus, no time-series database, just the same single-binary agent already running on each machine. The Problem We Kept Hitting We’ve been building Ingero — an eBPF agent that traces CUDA API calls and host kernel events to explain GPU latency. Until v0.9, it was single-node only. Trace one machine, explain what happened on that machine. For single-GPU inference or training, that worked well. But distributed training spreads the debugging surface across machines. When a 4-node DDP job slows down, the question is always: which node? And then: why? nvidia-smi on each machine reports healthy utilization. dstat shows nothing obvious. The typical workflow is SSH-ing into each box, eyeballing logs, diffing timestamps across terminals, and hoping the issue is still happening. We wanted a cross-node investigation without adding infrastructure. The question was: what’s the simplest architecture that works? What We Shipped in v0.9.1 Three features, all built on top of the existing per-node agent. No new services, no new daemons, no new ports. 1. Node Identity Every event now carries a node tag. The agent stamps each event with a name from a --node flag, an ingero.yaml config value, or the hostname as fallback: Shell sudo ingero trace --node gpu-node-01 Event IDs become node-namespaced (gpu-node-01:4821) so databases from different nodes can merge without collisions. For torchrun workloads, rank and world size are auto-detected from environment variables (RANK, LOCAL_RANK, WORLD_SIZE) — no extra configuration needed. 2. Fleet Fan-Out Queries Each Ingero agent already exposes a dashboard API over HTTPS (TLS 1.3, auto-generated ECDSA P-256 cert if no custom cert is provided). The new fleet client sends the same query to every node in parallel, collects the results, and concatenates them with a node column prepended. For production clusters, the client supports mTLS — --ca-cert, --client-cert, --client-key — so both sides authenticate. Plain HTTP is available via --no-tls but requires an explicit opt-in, and even then, it’s intended for trusted VPC networks only. The --nodes flag works for ad-hoc queries, but for anything beyond a handful of nodes, the node list goes into ingero.yaml once and every command picks it up automatically: YAML fleet: nodes: - gpu-node-01:8080 - gpu-node-02:8080 - gpu-node-03:8080 - gpu-node-04:8080 A full example config is in configs/ingero.yaml. Here’s what it looked like when we ran it against a 4-node cluster where one node was misbehaving: Shell $ ingero query --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080 \ "SELECT node, source, count(*) as cnt, avg(duration)/1000 as avg_us FROM events GROUP BY node, source" node source cnt avg_us ---------------- ------ ----- ------ gpu-node-01 4 11009 5.2 gpu-node-01 3 847 18400 # ← 9x higher than peers gpu-node-02 4 10892 5.1 gpu-node-02 3 412 2100 gpu-node-03 4 10847 5.3 gpu-node-03 3 398 1900 gpu-node-04 4 10901 5.0 gpu-node-04 3 421 2200 8 rows from 4 node(s) Node 1 jumps out immediately: 847 host events at 18.4ms average, while the other three sit around 2ms. One more command to see the causal chains: Shell $ ingero explain --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080 FLEET CAUSAL CHAINS - 2 chain(s) from 4 node(s) [HIGH] [gpu-node-01] cuLaunchKernel p99=843us (63.9x p50) - 847 sched_switch events + heavy block I/O Root cause: 847 sched_switch events + heavy block I/O Fix: Pin training process to dedicated cores with taskset; Add nice -n 19 to background jobs [MEDIUM] [gpu-node-01] cuMemAlloc p99=932us (5.0x p50) - 855 sched_switch events + heavy block I/O Root cause: 855 sched_switch events + heavy block I/O Fix: Pin training process to dedicated cores with taskset Both chains are on gpu-node-01. The other three nodes have zero issues. The root cause: CPU contention from block I/O — checkpoint writes preempting the training process. Two commands to go from “distributed training is slow” to “pin the training process on node 1 and investigate the I/O source.” 3. Offline Merge and Perfetto Export Not every environment allows live HTTP queries between nodes. Air-gapped clusters, locked-down VPCs, compliance constraints — there are real reasons the network path isn’t always available. For those cases, ingero merge combines SQLite databases from each node into a single queryable file: Shell # 1. Collect traces from each node scp gpu-node-01:~/.ingero/ingero.db node-01.db scp gpu-node-02:~/.ingero/ingero.db node-02.db # 2. Merge and analyze ingero merge node-01.db node-02.db -o cluster.db ingero explain -d cluster.db Stack traces are deduplicated by hash. Events keep their node-namespaced IDs. Old databases that predate the node column work with --force-node. For visual timeline analysis, ingero export --format perfetto produces a Chrome Trace Event Format JSON that opens in ui.perfetto.dev. Each node gets its own process track. Causal chains show up as severity-colored markers. The straggler is visible at a glance in the timeline. Why We Built It This Way The obvious approach to multi-node observability is a central collector: ship events to a time-series database, build dashboards, set up alerts. Prometheus, Datadog, Honeycomb — the well-trodden path. We deliberately avoided that. No new infrastructure. Ingero is a zero-config, single-binary agent with no dependencies. Adding a central collector contradicts that. The fleet client is 400 lines of Go in the existing binary. It reuses the HTTPS API the agent already exposes. Nothing new to deploy, nothing new to secure — the same TLS 1.3 + mTLS configuration that protects a single node’s dashboard protects the entire fleet. Client-side fan-out is simple and sufficient. The CLI sends concurrent HTTP requests, collects results, and merges them locally. A sync.WaitGroup, some JSON decoding, column concatenation. No distributed query planning, no consensus protocol, no coordinator election. For 4-50 nodes, this is the right level of complexity. Partial failure is first-class. If one node is unreachable, results from the others still come back, plus a warning. No all-or-nothing semantics. In practice, the unreachable node is often the one in trouble — and knowing which nodes failed is diagnostic information in itself. Clock skew is measured, not ignored. eBPF timestamps come from bpf_ktime_get_ns() (CLOCK_MONOTONIC), which is per-machine. When correlating events across nodes, clock differences matter. The fleet client runs NTP-style offset estimation in parallel with the actual query — 3 samples per node, median filter. On a typical LAN with sub-millisecond RTT, precision should be well under 10ms. If skew exceeds a threshold, it warns. This adds zero latency since it runs concurrently with the data query. Offline merge covers air-gapped environments. Some production GPU clusters have no internal HTTP connectivity between nodes. SCP the databases, merge locally, investigate. The merge path also serves as a permanent record of the cluster state at investigation time. MCP: AI-Driven Fleet Investigation The fleet is also accessible through Ingero’s MCP server via the query_fleet tool. Here’s what the raw tool output looks like for a chains query across the same 4-node cluster: Python query_fleet(action="chains", since="5m") Fleet Chains: 2 chain(s) [HIGH] gpu-node-01 | cuLaunchKernel p99=843us (63.9x p50) | 847 sched_switch events + heavy block I/O [MEDIUM] gpu-node-01 | cuMemAlloc p99=932us (5.0x p50) | 855 sched_switch events + heavy block I/O That’s the complete response — an AI assistant gets this back from one tool call, no SSH access to each node, no manual SQL. The tool supports four actions: chains (causal analysis), sql (arbitrary queries), ops (operation breakdown per node), and overview (event counts). Clock skew warnings are prepended automatically when detected. Where This Stands v0.9.1 is the initial step in cluster-level tracing, not the destination. What we have now works well for the reactive investigation workflow: something went wrong, we need to find out what and where. Fan-out queries, offline merge, Perfetto export — these are diagnostic tools for after the fact. We’re actively working on cross-node correlation and straggler detection — more updates coming soon. And since the instrumentation sits on host-level eBPF rather than vendor-specific hooks, none of this is limited to a specific GPU vendor. The bet is that client-side fan-out scales to 50+ nodes before anything centralized is needed. When it doesn’t, the node-namespaced ID scheme and offline merge path ensure the architecture can evolve without breaking existing deployments. We’re stress-testing the fan-out architecture against larger clusters and would welcome feedback from teams running multi-node training. Open an issue on GitHub. The investigations/ directory has ready-to-query databases for trying this without a GPU cluster: sample-gpu-node-01.db, sample-gpu-node-02.db, sample-gpu-node-03.db – individual node traces from a 3-node clustersample-cluster.db – all three merged into one (600 events, 6 chains, 9 stacks) GitHub (give us a star!): github.com/ingero-io/ingero. No NVIDIA SDK, no code changes, production-safe by design. If you are facing distributed training issues in your own workloads, we’d love to take a look. Drop an issue on GitHub, and we will gladly dive into it together. Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead. Related Reading GPU incident response in 60 seconds with eBPF – single-node investigation workflow that the fleet feature extends11-second time to first token on a healthy vLLM server – kernel-level scheduling contention causing hidden latency, similar to the straggler root cause in this postGPU showing 97% utilization while training runs 3x slower – why nvidia-smi metrics alone miss the real story
Database schema management is one of the most challenging aspect for modern Devops practices. Liquibase gives an open-source, database-solution for tracking, versioning, and deploying database changes across environments. This comprehensive guide explores Liquibase's architecture, implementation patterns, and automation strategies for CI/CD pipelines, with practical examples for enterprise deployment scenarios. What is Liquibase? Applying database schema changes with traditional SQL scripts is mostly manual, error-prone, and very hard to track. These scripts are typically lack version control, which make it very difficult to manage changes across all environments till production. Liquibase solves these problems by providing an open-source tool that standardizes how developers define, version, and deploy schema changes using simple configuration files. It brings consistency to change management with built-in rollback, change tracking, and support for multiple database systems. Core Concepts Changelog: The master file containing all database changes, organized sequentially. Can be XML, YAML, JSON, or SQL format. ChangeSet: An atomic unit of change with a unique identifier (id + author). Each changeset executes once and is tracked in DATABASECHANGELOG table. Preconditions: Conditional checks that must pass before executing changesets, ensuring safe deployments. Contexts and Labels: Filtering mechanisms to control which changesets execute in specific environments (dev, certification/ Integration, prod). Key Features Liquibase supports over 30 databases, including Oracle, MySQL, PostgreSQL, SQL Server, and DB2. This feature allows teams to work across different environments without compatibility issues.CHANGELOG is integrated directly with version control tools like Git and SVN, so they are stored in the application code repository. This approach confirms that changes are tracked and part of the development cycle.The platform provides robust rollback capabilities. This helps maintain system stability during updates.All DB Changes are tracked and handled using MD5 checksums, which prevent duplicate executions and unauthorized modifications. This ensures the integrity of database changes across environments.The system has ablity to compare database schemas across multiple environments and generate changelogs. This helps keep environments in sync and reduces manual effort for engineering teams. Liquibase Architecture Figure 1: Liquibase's layered architecture showing how changelog files are processed through the core engine, abstracted to database-specific implementations, and tracked in specialized tables. Implementing Liquibase Basic Changelog Structure A typical YAML changelog follows this structure: YAML databaseChangeLog: - changeSet: id: create-payment-table author: devops.team changes: - createTable: tableName: payment_transaction columns: - column: name: id type: varchar(50) constraints: primaryKey: true nullable: false - column: name: amount type: decimal(15,2) constraints: nullable: false - column: name: currency_code type: char(3) constraints: nullable: false - column: name: transaction_date type: timestamp defaultValueComputed: CURRENT_TIMESTAMP - column: name: status type: varchar(20) constraints: nullable: false - column: name: created_at type: timestamp defaultValueComputed: CURRENT_TIMESTAMP - column: name: updated_at type: timestamp rollback: - dropTable: tableName: payment_transaction Configuration Properties Configure database connection in liquibase.properties: liquibase-dev.properties: Properties files # Development Environment changeLogFile=db/changelog.yaml url=jdbc:oracle:thin:@//localhost:1521/DEVDB username=dev_user password=dev_pass contexts=dev,test defaultSchemaName=DEV_SCHEMA liquibase.dropFirst=true liquibase.shouldRun=true logLevel=DEBUG Preconditions for Safe Deployment Preconditions make sure database state is correct before applying changes. For example, checking if a column exists before adding it prevents errors: SQL --liquibase formatted sql --changeset devops.team:add-merchant-id --preconditions onFail:MARK_RAN onError:HALT --precondition-sql-check expectedResult:0 SELECT COUNT(*) FROM information_schema.columns WHERE table_name = 'payment_transaction' AND column_name = 'merchant_id' ALTER TABLE payment_transaction ADD COLUMN merchant_id VARCHAR(50); --rollback ALTER TABLE payment_transaction DROP COLUMN merchant_id; Automating Database Deployments Integrating Liquibase into CI/CD pipelines enables automated, consistent database deployments across all environments. Modern deployment strategies include Jenkins pipelines, GitLab CI/CD, and Kubernetes init containers. CI/CD Pipeline Integration Figure 2: Complete CI/CD workflow showing validation, staging deployment with automated testing, and production release with manual approval gates and automatic rollback on failure. Jenkins Pipeline Example A declarative Jenkins pipeline automates the entire deployment workflow: Groovy pipeline { agent any stages { stage('Validate') { steps { sh 'liquibase validate' } } stage('Deploy Staging') { steps { sh 'liquibase --contexts=staging update' } } stage('Deploy Production') { when { branch 'main' } steps { input 'Deploy to Production?' sh 'liquibase --contexts=prod update' } } } post { failure { sh 'liquibase rollbackCount 1' } } Kubernetes Init Container Pattern For cloud-native deployments, Liquibase runs as an init container which ensure schema updates complete before application startup: YAML apiVersion: apps/v1 kind: Deployment metadata: name: payment-service spec: template: spec: initContainers: - name: liquibase-migration image: liquibase/liquibase:4.25.0 command: - liquibase - --url=jdbc:postgresql://$(DB_HOST):5432/$(DB_NAME) - --username=$(DB_USER) - --password=$(DB_PASSWORD) - update env: - name: DB_HOST valueFrom: secretKeyRef: name: db-credentials key: host - name: DB_NAME valueFrom: secretKeyRef: name: db-credentials key: database - name: DB_USER valueFrom: secretKeyRef: name: db-credentials key: username - name: DB_PASSWORD valueFrom: secretKeyRef: name: db-credentials key: password containers: - name: payment-service image: payment-service:latest ports: - containerPort: 8080 protocol: TCP env: - name: DB_HOST valueFrom: secretKeyRef: name: db-credentials key: host Deployment Workflow Details Figure 3: Detailed deployment workflow showing lock acquisition, MD5 validation, precondition checking, transaction management, and metrics emission throughout the deployment lifecycle. Best Practices and Enterprise Patterns Organizational Strategies Use a master changelog that includes versioned sub-changelogs organized by release or sprint for better maintainability.Keep changesets focused on single logical changes to enable granular rollback and easier troubleshooting.Never modify executed changesets; create new ones for corrections. MD5 checksums enforce this automatically.It is always better to provide rollback definitions even for reversible operations to maintain deployment predictability. Security and Compliance Secret Management: Never commit credentials to version control (Application code repository). Design application in was to get those credentials from HashiCorp Vault, AWS Secrets Manager, or Kubernetes secrets for database credentials.Least Privilege Access: Never grant Liquibase user DBA privileges. Grant Liquibase users only DDL permissions required for schema changes.Audit Trail Integration: Enable database audit logging and integrate with SIEM systems for compliance requirements (SOC 2, PCI-DSS)Mandatory Peer Review: Implement strong Pull request review gate/ workflow before merging any changlog modification. Add a second pair of eyes for review here. Performance Optimization Index Strategy: Create indexes after bulk data loads rather than before to minimize I/O during migrationDatabase-Specific Optimization: Use native SQL with dbms attribute for performance-critical operations (e.g., Oracle PARALLEL hints)Lock Timeout Configuration: Set appropriate liquibase.lockWaitTime values to prevent indefinite waits in high-concurrency environments Testing Pyramid for Database Changes Implement comprehensive testing at multiple levels: Developers run liquibase update against local containers before committing changesUse Testcontainers to spin up ephemeral databases, apply migrations, and verify schema correctnessDeploy to a production-like environment with realistic data volumes to identify performance issues Test rollback procedures in staging before production deployment to ensure recovery capability Monitoring and Observability Effective monitoring of database deployments enables rapid issue detection and continuous improvement. Key metrics include deployment duration, lock contention events, checksum validation failures, and rollback frequency. Critical Metrics to Track Deployment Duration: Track execution time per changeset to identify performance bottlenecks and optimize slow operationsLock Contention: Monitor DATABASECHANGELOGLOCK table for concurrent deployment conflicts that block releasesChecksum Failures: Alert on MD5 mismatches indicating unauthorized manual changes to executed changesetsRollback Rate: Track rollback events as key indicators of deployment quality and testing effectiveness Prometheus Integration Example Export Liquibase metrics to Prometheus for visualization in Grafana dashboards: Properties files # Deployment success counter liquibase_deployment_success_total{environment="prod"} 145 # Changeset execution duration histogram liquibase_changeset_duration_seconds_bucket{id="1",database="prod",le="0.1"} 45 liquibase_changeset_duration_seconds_bucket{id="1",database="prod",le="0.5"} 120 liquibase_changeset_duration_seconds_bucket{id="1",database="prod",le="1.0"} 142 liquibase_changeset_duration_seconds_bucket{id="1",database="prod",le="+Inf"} 145 liquibase_changeset_duration_seconds_sum{id="1",database="prod"} 52.3 liquibase_changeset_duration_seconds_count{id="1",database="prod"} 145 # Rollback counter for failure tracking liquibase_rollback_total{environment="prod"} 3 # Additional operational metrics liquibase_deployment_failure_total{environment="prod",reason="validation_error"} 2 liquibase_deployment_failure_total{environment="prod",reason="timeout"} 1 # Deployment success rate (calculated metric) # success_rate = liquibase_deployment_success_total / (liquibase_deployment_success_total + sum(liquibase_deployment_failure_total)) # Result: 145 / (145 + 3) = 98.0% Conclusion Liquibase provides enterprise-grade database change management with seamless DevOps integration. Treating schemas as code enables consistency, traceability, and automation for deployment pipelines. For those organizations that are managing critical payment infrastructure, Liquibase's rollback capabilities, precondition validation, and change tracking provide essential safeguards. With CI/CD integration, observability tooling, and security controls, teams deploy database changes with the same confidence and velocity as application code.
Modern API-led architectures are built for resilience. We add: Retries for transient failuresReplication for durabilityAutoscaling for elasticityCircuit breakers for isolation Each mechanism improves availability. Under stress, their interaction can bring the system down. Most enterprise outages aren’t caused by missing fault tolerance. They’re caused by unbounded fault-tolerance mechanisms reacting simultaneously. Let’s break down how this happens — and how to design bounded reliability instead. 1. Retry Storms: When Resilience Multiplies Traffic Retries are meant to protect against temporary failures. But retries multiply load. This is a simplified version of what we often see in service-to-service retry logic: Plain Text import time import random def downstream_service(): latency = random.choice([0.1, 0.2, 0.8]) time.sleep(latency) if latency > 0.7: raise TimeoutError("Slow response") return "OK" def call_with_retries(max_attempts=3): for attempt in range(max_attempts): try: return downstream_service() except TimeoutError: print(f"Retry {attempt+1}") raise Exception("Failed after retries") Under normal conditions: Works fine. Under load: Latency increases.Timeouts trigger.Each request retries 3 times.Traffic triples.Backend slows further.More retries fire. That’s a retry storm. Now imagine this inside an API-led architecture: Gateway → Experience API → Process API → System APIs → ERP/DB If each layer retries independently, load amplification becomes multiplicative. In one system I worked on, we saw a single downstream slowdown take out three upstream APIs within minutes because each layer had its own retry logic. Bounded Retry Pattern (Production-Safe) Retries must be: LimitedBacked off exponentiallyJitteredDisabled under system stress Safer version: Plain Text def call_with_bounded_retries(max_attempts=2, system_load=0.5): if system_load > 0.75: return None # fail fast when under stress for attempt in range(max_attempts): try: return downstream_service() except TimeoutError: backoff = 0.2 * (2 ** attempt) time.sleep(backoff + random.uniform(0, 0.1)) return None Key differences: Retry ceiling reducedExponential backoffJitter prevents synchronized wavesLoad-aware short-circuit Retries should dampen instability — not amplify it. 2. Replication Fan-Out and Coordination Collapse Replication improves durability. But synchronous replication increases coordination cost. Example: Plain Text import time def simulate_write(): time.sleep(0.2) def write_to_replicas(data, replicas=3): for _ in range(replicas): simulate_write() Under surge traffic: Write volume increases.Each write fans out to 3 replicas.Replica lag grows.Clients retry writes.Effective write load doubles. Durability turned into a bottleneck. In enterprise integration systems (order processing, billing, reconciliation), this pattern causes throughput collapse — not because data was lost, but because coordination overwhelmed the system. Tiered Durability Strategy Not all writes need identical guarantees. Plain Text def write(data, critical=True): if critical: write_to_replicas(data, replicas=3) else: write_to_replicas(data, replicas=1) Separate: Critical transactions → strong durabilityNon-critical logs/events → reduced coordination Reliability must be scoped — not maximized blindly. 3. Autoscaling Feedback Loops Autoscaling reacts to traffic metrics. But traffic metrics may be artificial. If retries inflate request counts: Plain Text def autoscale(request_rate): if request_rate > 100: print("Scaling up") Scaling triggers: New instances initialize.Initialization hits shared DB/cache.Backend latency increases.More timeouts occur.Retry rate rises. Autoscaling accelerated instability. Safer Scaling Signals Scale on: Sustained demand (not spikes)Latency distribution trendsOrganic RPS (excluding retries)Queue growth rate Example: Plain Text def autoscale_safe(request_rate, sustained_load): if sustained_load and request_rate > 120: print("Scaling safely") Autoscaling should respond to organic demand — not retry amplification. 4. The Real Problem: Correlated Reactions Retries respond to latency.Replication responds to writes.Autoscaling responds to traffic.Circuit breakers respond to error rates.Under stress, they react to the same signal.That correlation creates cascading failure.Distributed systems behave like feedback systems.Unbounded feedback loops destabilize them. Real-World Scenario: Payment Reconciliation API Consider a payment reconciliation service: Gateway → Process API → Billing → ERP → Database What happens during a minor ERP slowdown? ERP latency increases to 700ms.Billing times out at 500ms.Billing retries 3 times.Process API retries orchestration.Gateway retries client request.Autoscaling reacts to spike.DB replication lag increases.DLQ starts growing. Within minutes, a small slowdown becomes a platform-wide incident. Root cause: unbounded reaction. 5. Guardrails for Bounded Reliability in API Systems 1. Retry Budgets Effective Load = Incoming RPS × Retry Count If RPS = 1,000 and retries = 3 Effective load = 3,000 Cap retries per request and per service. 2. Failure Classification Not all errors are retriable. Error Type Retry? Action CONNECTIVITY Yes Bounded retry TIMEOUT Yes Backoff VALIDATION No Fail fast AUTH No Alert Blind retries are architectural debt. 3. Idempotency Enforcement Retries without idempotency cause corruption. Unsafe: Plain Text transaction_id = uuid() Safe: Plain Text transaction_id = payload.get("transaction_id") or request.headers["correlation-id"] Every retry must produce the same logical result. 4. DLQ With Observability Track: Retry percentageTimeout frequencyDLQ growth velocityP95 latency shifts These are early warning signals. None of these controls are free. Reducing retries can increase error rates in some scenarios, and limiting replication can affect durability guarantees. The goal isn’t to eliminate these mechanisms, but to apply them intentionally based on system behavior. 5. Design for Stability, Not Perfection The goal of distributed reliability isn’t maximum redundancy. It’s controlled degradation under stress. Bound retries. Scope replication. Dampen scaling reactions. Enforce idempotency. Monitor feedback loops. Minor latency should not become a cascading outage. Reliability is not about adding mechanisms. It’s about controlling how they interact. Final Thoughts Retry storms don’t start with catastrophic failure. They start with: A small latency increaseA few timeoutsA handful of retries Then fault-tolerance mechanisms react — together. Retries multiply traffic.Replication increases coordination pressure.Autoscaling amplifies backend load. Within minutes, a minor slowdown becomes a cascading outage. Reliability in API-led distributed systems is not about adding more safety nets. It’s about bounding how those safety nets behave under stress. Limit retries.Classify failures.Enforce idempotency.Scale on sustained demand — not noise.Monitor feedback loops before they spiral. The difference between a resilient platform and a cascading failure often comes down to one thing: Whether your reliability mechanisms are controlled — or uncontrolled. Design for stability under stress. Not perfection under ideal conditions.
Abhishek Gupta
Principal PM, Azure Cosmos DB,
Microsoft
Otavio Santana
Award-winning Software Engineer and Architect,
OS Expert