Databases Resources

DZone's Featured Databases Resources

12 Factor Framework for Building Secure and Compliant Cloud Applications

By Josephine Eskaline Joyce

CORE

It began with a late-night alert. A critical cloud application, serving thousands of users, had just been flagged for a security violation. No “hack” had occurred; nothing obviously was broken. What appeared to be a minor misconfiguration had quietly exposed sensitive data. The system was still running. The business was still operating. But compliance? Already compromised. The team scrambled. Was it an identity issue? A pipeline gap? A missing policy? Every layer seemed secure in isolation—but together, something had slipped through. That night revealed a hard truth: security and compliance aren’t features you add—they are properties you design into every layer of a cloud application. This is where a structured approach becomes essential—a way to think systematically about building applications that are not just scalable and observable but inherently secure and compliant by design. This blog explores a 12-factor security framework to do exactly that. What Does “Secure and Compliant by Design” Mean? “Secure and compliant by design” means that security and compliance are built into the foundation of a cloud application—not added later as patches, tools, or audit activities. Traditionally, teams would: Build the application firstTest functionalityAdd security checks before releasePrepare compliance evidence only during audits This approach creates gaps because security becomes reactive and compliance becomes periodic. "Secure and compliant by design" flips this model and introduces three key shifts: Shift left: Security and compliance should start early. Secure coding practicesDependency scanning in developmentPolicy checks in CI/CD pipelinesOutcome: Issues are prevented rather than fixed later.Continuous, not periodic: Compliance is no longer an annual or quarterly exercise. Policies are enforced automaticallySystems are continuously validatedDrift is detected in real timeOutcome: You're always audit-ready.Embedded across layers: Security and compliance are enforced at every layer of the system. Application layer – secure code, input validationInfrastructure layer – hardened configurationsIdentity layer – strict access controlsRuntime layer – monitoring and threat detectionOutcome: No single point of failure. The 12 Factors Overview Security and compliance are not a single layer—they are a system of interconnected controls surrounding and protecting the application at every stage. The proposed 12 factors are organized across five architectural pillars: Category Objective Associated Factors Application Foundations Establish secure, consistent, and portable application design principles Codebase, Dependencies, Configuration Identity, Trust, and Security Controls Protect identities, secrets, and trust boundaries across the application lifecycle Credentials & Secrets Management, Identity and Access Control Runtime and Delivery Architecture Govern application packaging, deployment, and runtime execution behavior Build–Release–Run, Processes, Port Binding Observability, Governance, and Compliance Enable monitoring, auditability, policy enforcement, and operational visibility Logs, Admin Processes Operational Resilience and Scalability Improve elasticity, fault tolerance, and operational continuity Concurrency, Disposability, Dev/Prod Parity The architecture diagram below shows the proposed structure of the 12 factors for secure and compliant cloud applications; the factors are grouped into five capability domains. Rather than functioning as isolated practices, these domains collectively establish a secure-by-design, resilient, scalable, and compliance-aware cloud-native architecture that supports both technical and business outcomes. Note: Operational resilience is not represented by a single control but emerges from the combined implementation of incident response, observability, workload protection, and robust infrastructure practices. Operationalizing the 12 Factors Modern cloud applications cannot use siloed security controls or compliance checks that come into play at later stages of the development process. Security and compliance should be built into the development lifecycle and applied consistently across architecture, deployment workflows, runtime environments, and operational processes. The 12-factor framework outlines a framework for organizing security and compliance practices that consists of five key, interlinked layers: Application Foundation, Identity and Trust, Runtime and Delivery, Operational Resilience, and Observability & Governance. Each layer addresses a specific objective, but they all help to form a secure-by-design, compliant-by-default architecture. Application Foundation This layer builds the baseline structure and security posture of the application. It focuses on ensuring that application configurations, dependencies, and code artifacts remain consistent, reproducible, and externally managed. Key considerations include: Externalizing configurations and secretsManaging dependencies through controlled mechanismsMaintaining immutable and version-controlled artifactsStandardizing application packaging and deployment patterns Having a good foundation reduces configuration drift, minimizes hidden dependencies, and creates predictable application behavior across environments. Identity and Trust Identity becomes the primary security boundary in cloud-native systems where applications, services, and workloads communicate dynamically. This layer focuses on: Strong workload and service identitiesSecure authentication and authorization mechanismsPrinciple of least privilege accessSecret lifecycle and credential management The objective is to establish trusted interactions between users, applications, services, and infrastructure resources. Runtime and Delivery Applications continuously evolve through deployment pipelines and operational updates. Secure runtime execution and delivery processes ensure that changes can be introduced without compromising reliability or compliance. Key areas include: Secure CI/CD pipelinesImmutable deployment patternsControlled rollout strategiesContainer and workload security enforcementPolicy-driven deployment validation This layer enables rapid delivery while preserving operational safety. Observability and Governance Visibility and governance provide continuous assurance that systems operate within expected security and compliance boundaries. This layer includes: Metrics, logs, and distributed tracingContinuous compliance monitoringPolicy-as-Code enforcementAudit evidence collectionSecurity posture assessment and reporting Effective observability transforms operational signals into actionable insights while supporting governance requirements. Operational Resilience Security and compliance also depend on maintaining application availability and handling failures gracefully. Important capabilities include: Self-healing mechanismsControlled failure handlingHigh availability strategiesBackup and recovery proceduresAutomated incident response Resilience mechanisms reduce operational risk and help maintain service continuity under adverse conditions. These five layers build a comprehensive defense architecture where security, compliance, operational reliability, and governance are not discrete activities but rather integrated functions of the application. The subsequent sections describe each of the twelve factors in detail and explain their practical implementation within cloud-native environments. Architectural Anti-Patterns in Cloud-Native Security and Compliance Although many organizations are investing in cloud security tools and compliance frameworks, most of the time failures cannot be attributed to technology but rather to recurring anti-patterns, habits, and decisions that unintentionally introduce risk. Understanding these pitfalls is key in developing systems that are truly secure and compliant by design. Below are some of the most common anti-patterns: Hard-coded secrets and configuration: Credentials, API keys, or environment-specific settings are embedded directly in the source code.Impact: Increased risk of credential exposure, security breaches, and configuration drift.Over-privileged access and shared identities: Users and services receive permissions beyond operational requirements.Impact: Expands the attack surface and increases the blast radius of compromised workloads.Security as a late-stage activity: Security validation occurs after development and deployment activities are completed.Impact: Delayed remediation, higher operational cost, and inconsistent policy enforcement.Mutable infrastructure and manual changes: Direct modifications are applied to running environments without controlled deployment processes.Impact: Creates configuration drift and reduces reproducibility.Limited observability and reactive monitoring: Insufficient metrics, logs, and traces limit operational visibility.Impact: Slower incident detection and longer recovery times.Siloed governance and compliance processes: Governance activities operate independently from engineering workflows.Impact: Compliance gaps, duplicated effort, and reduced delivery efficiency.Ignoring runtime security controls: Security controls focus only on build-time validation and neglect runtime monitoring.Impact: Undetected threats and reduced visibility into active workloads.Missing continuous feedback loops: Application metrics, security events, operational incidents, and compliance findings are not continuously integrated back into development and operational workflows.Impact: Repeated failures, delayed remediation, limited learning from incidents, and slower improvement of security and operational practices. Aligning With Industry Standards The framework aligns with global security and compliance standards. The framework embeds governance, access control, observability, and resilience practices directly into the software lifecycle by not treating compliance as a distinct validation exercise. The table below shows how the 12-factor framework aligns with common industry security and compliance standards. Standard / Framework Primary Focus How the 12-Factor Framework Supports It NIST Cybersecurity Framework Identify, Protect, Detect, Respond, Recover Supports policy enforcement, monitoring, identity controls, and resilience practices SOC 2 Security, availability, processing integrity Improves auditability, access management, and operational monitoring ISO 27001 Information security management Encourages risk-based controls, governance processes, and secure operational practices CIS Benchmarks Secure system and workload configuration Reinforces secure configurations and standardized deployment practices Zero Trust Architecture Continuous verification and least privilege Strengthens workload identity, authentication, and access controls HITRUST Security and compliance for regulated data Enhances governance, audit controls, and protection of sensitive information Getting Started: A Practical Roadmap Adopting a secure and compliant cloud application framework is not a one-time effort, and it is a progressive journey. This needs to be treated as a phased transformation with continuous improvements to be successful. Phase 1—Assess and Baseline: Before implementing controls, it is critical to understand your current posture. Focus areas: Inventory applications, services, and dependenciesEvaluate current security practices across the lifecycleIdentify gaps in identity, configuration, and observabilityMap existing controls to compliance requirements (e.g., SOC2, ISO 27001)Outcome: Clear visibility into risk exposure and compliance gapsA prioritized list of areas needing attentionPhase 2 - Establish Secure Foundations: Build the baseline capabilities that enforce security by default. Focus areas: Implement secure CI/CD pipelines with integrated scanning. Centralize secrets management and eliminate hardcoded credentials. Enforce least-privilege IAM policies Define secure configuration baselines (IaC templates, guardrails)Outcomes: Strong foundation layer aligned with Application Foundation and Identity pillars Reduced risk from common vulnerabilitiesPhase 3 - Automate Security and Compliance: Manual processes do not scale in cloud environments; automation is essential. Focus areas: Introduce policy-as-code (OPA, Kyverno)Enable continuous compliance monitoringAutomate security checks in pipelinesDetect and remediate configuration driftOutcome: Shift from reactive to proactive enforcementAlways-on compliance posturePhase 4 - Strengthen Runtime and Resilience: Once the foundation is secure, focus on protecting systems in production. Focus areas: Implement runtime threat detection and workload protectionEnable network segmentation and encryption (Zero Trust)Define incident response playbooksBuild resilience mechanisms (failover, DR, fault tolerance)Outcome: Systems that are not only secure, but also resilient to failure and attackPhase 5 - Enable Observability and Continuous Improvement: Security and compliance must evolve with the system. Focus areas: Centralize logs, metrics, and tracesCorrelate observability data for threat detectionEstablish feedback loops from operations to developmentContinuously refine policies and controlsOutcome: A closed-loop system where insights drive ongoing improvementFaster detection, response, and optimization Example Technology Enablers Layer Capability Example Tools Application Foundation Infrastructure as Code & Packaging Terraform, Helm Source Control & Artifact Management Git, Artifact Registry CI/CD & Pipeline Automation Jenkins, GitHub Actions, Tekton, ArgoCD Supply Chain & Security Scanning Snyk, Trivy, Dependabot Secrets Management HashiCorp Vault, Kubernetes Secrets, IBM Cloud Secrets Manager Identity & Trust Identity & Access Management (IAM) IAM platforms, Azure AD, IBM Cloud IAM Workload Identity & Zero Trust SPIFFE/SPIRE, Keycloak Authentication & Authorization OAuth/OIDC providers, Keycloak Runtime & Delivery Container & Workload Security Falco, Prisma Cloud, Aqua Deployment & Continuous Delivery Jenkins, ArgoCD, Tekton Network Security & Service Mesh Istio, Linkerd, Service Mesh Configuration & Posture Management CSPM tools (Wiz, Prisma, AWS Config) Observability & Governance Metrics, Logs & Tracing Prometheus, Grafana, OpenTelemetry, Instana Policy Enforcement (Policy-as-Code) OPA, Kyverno Security & Compliance Monitoring Splunk, ELK, Security & Compliance platforms Operational Resilience High Availability & Scaling Kubernetes HPA Disaster Recovery & Backup Velero, IBM Cloud Backup and Recovery Chaos Engineering & Testing Chaos Monkey, Litmus Incident Management PagerDuty, Opsgenie Conclusion Imagine two organizations adopting cloud-native technologies. One continuously responds to security vulnerabilities, operational problems, and compliance needs as they become apparent. The other incorporates security, resilience, and governance through architecture from inception. Over time, the difference becomes clear. One struggles to keep up with change, while the other moves with confidence as security and compliance are no longer separate but inherent capabilities. The proposed 12-factor framework is ultimately about enabling this shift, moving from reactive controls toward secure-by-design and compliant-by-default cloud applications. More

How to Build a Brand Monitoring Dashboard With SerpApi and Python

By Tomas Murua

Knowing what people say about your product usually means checking Google News, scrolling through YouTube, and digging into different social media threads. That's three tabs, three interfaces, and no way to compare what you find. This tutorial builds a single dashboard that pulls brand mentions from all three sources using Python and SerpApi. By the end, you'll have a Streamlit app with three tabs, one for news articles, one for YouTube videos, and one for social media and forum discussions. We'll use "serpapi" as the search query, but you can swap the brand or product name. Brand monitoring dashboard showing metrics row with total mentions, news articles, YouTube videos, and perspectives counts Set Up Your Environment Requirements: Python 3.8+SerpApi API Key (the free plan includes 250+ searches/month)Dependencies (serpapi, pandas, streamlit, altair) The serpapi package is the official Python SDK. It handles request signing, retries, and response parsing. The complete code, including a Jupyter notebook version, is available in the SerpApi tutorials repository. The Pipeline The app follows the same three-step pattern from the GitHub Issues dashboard: fetch raw data, transform it, and display the analysis. Pipeline diagram showing three stages: fetch, transform, and display The difference this time is three separate engines running in parallel. Each returns a different response structure, so the transform step normalizes everything into DataFrames before the dashboard consumes it. Fetch the Data A single SerpApi client instance works for all three engines: Python import serpapi import os SERPAPI_KEY = os.environ.get("SERPAPI_KEY", "") client = serpapi.Client(api_key=SERPAPI_KEY) Google News The Google News API returns articles through the news_results key. Each result includes title, link, source (a dict with name and icon), date, and snippet. Python def fetch_news(client, brand): """Fetch news articles mentioning the brand via Google News.""" results = client.search({ "engine": "google_news", "q": brand, "gl": "us", "hl": "en", }) return results.get("news_results", []) For more use cases with this engine, refer to the news monitoring. YouTube The YouTube Search API uses search_query instead of q, and the sp parameter controls time filters. The values EgIIAw%3D%3D (this week) and EgIIBA%3D%3D (this month) are YouTube's internal encoding for upload date filters. You can grab these from YouTube's URL bar after applying a filter manually. We run both filters and deduplicate by link, since the month results include everything from the week: Python YT_FILTER_WEEK = "EgIIAw%3D%3D" YT_FILTER_MONTH = "EgIIBA%3D%3D" def fetch_youtube(client, brand): """Fetch YouTube videos, combining week and month filters.""" seen = set() videos = [] for sp_filter in (YT_FILTER_WEEK, YT_FILTER_MONTH): results = client.search({ "engine": "youtube", "search_query": brand, "sp": sp_filter, }) for video in results.get("video_results", []): link = video.get("link", "") if link and link not in seen: seen.add(link) videos.append(video) return videos For more examples using the YouTube API, refer to this link. Google Perspectives Google Perspectives API surfaces user-generated content from LinkedIn, Reddit, Quora, and blogs. It uses the standard Google engine, and the results appear under the perspectives key: SerpApi search with the Google perspective results Python def fetch_perspectives(client, brand): """Fetch user-generated content (Reddit, LinkedIn, Quora).""" results = client.search({ "engine": "google", "q": brand, "google_domain": "google.com", }) return results.get("perspectives", []) Fetch in Parallel Three sequential API calls take roughly three seconds. Running them in parallel with Python ThreadPoolExecutor brings that down to about one second. Each call runs in its own thread while the others wait for their response: Python from concurrent.futures import ThreadPoolExecutor @st.cache_data(ttl=300) def fetch_all_mentions(brand): """Fetch all brand mentions from three engines in parallel.""" client = serpapi.Client(api_key=SERPAPI_KEY) with ThreadPoolExecutor(max_workers=3) as pool: news_future = pool.submit(fetch_news, client, brand) yt_future = pool.submit(fetch_youtube, client, brand) persp_future = pool.submit(fetch_perspectives, client, brand) return news_future.result(), yt_future.result(), persp_future.result() SerpApi also offers a server-side async parameter for large-scale batch processing, where you submit searches and retrieve results later. For our three concurrent calls, client-side threading is simpler and equally effective. The @st.cache_data(ttl=300) decorator caches results for 5 minutes. Without it, every Streamlit interaction would re-trigger the API calls. This works alongside SerpApi's own 1-hour result cache, which serves identical queries from the cache at no extra search cost unless you explicitly pass no_cache=true. Together, these two layers minimize redundant API calls during development and testing. For more optimization techniques when working with SerpApi at scale, refer to this blog. Transform the Data All three engines return dates as relative strings ("3 hours ago", "2 days ago"). We need a shared parser to convert them into datetime objects for sorting. Parse Relative Dates Two details worth noting. The regex is compiled once and reused since this function runs for every result in all three engines. And the fallback returns datetime.now(timezone.utc) instead of None, so results without a parseable date sort to the top rather than breaking pandas operations. Python import re from datetime import datetime, timedelta, timezone RELATIVE_DATE_RE = re.compile( r"(\d+)\s+(second|minute|hour|day|week|month|year)s?\s+ago", re.IGNORECASE ) UNIT_TO_TIMEDELTA = { "second": lambda n: timedelta(seconds=n), "minute": lambda n: timedelta(minutes=n), "hour": lambda n: timedelta(hours=n), "day": lambda n: timedelta(days=n), "week": lambda n: timedelta(weeks=n), "month": lambda n: timedelta(days=n * 30), "year": lambda n: timedelta(days=n * 365), } def parse_relative_date(text): """Convert '3 hours ago' into a datetime object.""" if not text: return datetime.now(timezone.utc) match = RELATIVE_DATE_RE.search(str(text)) if not match: return datetime.now(timezone.utc) amount = int(match.group(1)) unit = match.group(2).lower() delta = UNIT_TO_TIMEDELTA.get(unit, lambda n: timedelta())(amount) return datetime.now(timezone.utc) - delta Build DataFrames Each engine gets into its own transformer. Here's the news version: Python def transform_news(results): """Convert raw Google News results into structured records.""" records = [] for item in results: source = item.get("source") or {} source_name = source.get("name", "Unknown") if isinstance(source, dict) else str(source) records.append({ "title": item.get("title", ""), "link": item.get("link", ""), "source": source_name, "date": parse_relative_date(item.get("date", "")), "snippet": item.get("snippet", ""), }) return records The source field can be a dict or a plain string depending on the result, so the isinstace check handles both. YouTube and Perspectives follow the same pattern, with two differences worth highlighting. YouTube views come back as strings like "1,234 views", so we strip non-numeric characters before converting: Python views = item.get("views") or 0 if isinstance(views, str): views = int(re.sub(r"[^\d]", "", views) or 0) Build the Dashboard The Streamlit interface starts with a form for the brand query and a row of summary metrics across all three sources: Python st.set_page_config(page_title="Brand Monitoring Dashboard", layout="wide") st.title("Brand Monitoring Dashboard") with st.form("brand_form"): brand = st.text_input("Brand or keyword to monitor", value="serpapi") submitted = st.form_submit_button("Search") Brand or keyword selector to monitor After fetching, the dashboard shows four metrics at the top for a quick overview, then splits into three tabs: Python col1, col2, col3, col4 = st.columns(4) col1.metric("Total Mentions", total_mentions) col2.metric("News Articles", len(news_records)) col3.metric("YouTube Videos", len(yt_records)) col4.metric("Perspectives", len(persp_records)) Dashboard metrics row displaying total mentions across three sources News Tab The News tab pairs an Altair bar chart of top sources with a sortable table. Altair ships with Streamlit, so there's nothing extra to install. We use it instead of st.bar_chart because it gives control over orientation, tooltips, and styling. Python source_df = news_df["source"].value_counts().head(10).reset_index() source_df.columns = ["source", "count"] source_chart = alt.Chart(source_df).mark_bar( cornerRadiusTopRight=4, cornerRadiusBottomRight=4 ).encode( x=alt.X("count:Q", title="Articles"), y=alt.Y("source:N", sort="-x", title=""), color=alt.value("#4A90D9"), tooltip=["source:N", "count:Q"], ).properties(height=350) st.altair_chart(source_chart, use_container_width=True) News tab with horizontal bar chart of top sources and sortable article table The table uses st.column_config.LinkColumn so each article title links directly to its source. YouTube Tab The YouTube tab shows views by channel and a sorted video table. The chart groups views by channel to surface which creators talk about the brand the most. Python channel_df = yt_df.groupby("channel")["views"].sum().reset_index() channel_df = channel_df.sort_values("views", ascending=False).head(10) channel_chart = alt.Chart(channel_df).mark_bar( cornerRadiusTopRight=4, cornerRadiusBottomRight=4 ).encode( x=alt.X("views:Q", title="Views", axis=alt.Axis(format="~s")), y=alt.Y("channel:N", sort="-x", title=""), color=alt.value("#4A90D9"), tooltip=["channel:N", alt.Tooltip("views:Q", format=",")], ).properties(height=350) YouTube tab showing views by channel chart and video table Perspectives Tab The Perspectives tab splits the layout between a discussion table on the left, and a donut chart of mentions by platform on the right. The donut chart makes it easy to see where conversations happen, whether it's LinkedIn, Reddit, X, etc. Python platform_chart = alt.Chart(platform_df).mark_arc( innerRadius=60, outerRadius=120 ).encode( theta=alt.Theta("count:Q"), color=alt.Color("source:N", legend=alt.Legend(title="Platform")), tooltip=["source:N", "count:Q"], ).properties(height=350) Perspectives tab with discussions table on the left and donut chart of mentions by platform on the right When to Use This Approach Ideal for: Tracking brand mentions across news, video, and social in one viewMonitoring product launches, PR campaigns, or competitor namesBuilding internal dashboard for marketing or DevRel teams Not recommended for: Real-time alerting. The API returns a snapshot, not a stream. For notifications, schedule the script on an interval and compare results.Historical analysis. Each engine returns recent results, not a complete archive. If you want to explore the API response before writing code, the SerpApi Playground lets you test any engine interactively. And if you only need news coverage, the Google News API alone handles most brand monitoring use cases. Where to Go from Here This dashboard gives you a live snapshot. The natural next step is turning it into a historical record. Store each fetch in a database (SQLite, PostgreSQL, or even a CSV), and you can compare mention volume week over week, track which sources cover your brand consistently, and spot trends that a single snapshot can't show. With historical data in place, you can layer on more analysis. Identify content gaps by looking at what topics competitors get covered on, but you don't. Track which YouTube channels mention your product and how their view counts trend over time. Flag new platforms or authors that start discussing your brand. The data is yours to work with however fits your needs. The three engines give you the raw material; what you build on top depends on the questions you're trying to answer. Conclusion The full application is about 350 lines in a single Python file. Three API calls, three DataFrames, three tabs. The query input at the top lets you switch brands without changing the code. What started as a way to check where "serpapi" shows up on the web became a tool that surfaces patterns you miss manually. The Perspectives tab pulls in LinkedIn posts, Reddit threads, and Quora answers that don't appear in regular news or video searches, and combining them in one view gives you the full picture. Check out the full SerpAPI article collection here. More

GraphRAG in Practice Using Spring AI, Neo4j, and Goodreads Data

By Akmal Chaudhri

CORE

AWS Glue ETL Design Principles for Production PySpark Pipelines

By Janani Annur Thiruvengadam

CORE

API Facade vs. Orchestration vs. Eventing, Now With AI in the Loop

By Jubin Abhishek Soni

CORE

From Gherkin to Source Code Without Losing the Business Language

Picture this: you are a software developer building an education platform, and you receive from the product owner some requirements written in business language (Gherkin). You need to implement these scenarios in Python. Probably you will start creating models and service modules. You will create some classes to represent the entities described in the scenarios, like Student, Course, and Subject. You will add conditionals and loops in the entity classes to control the business logic and restrict paths in the code: Python # Enroll a student in a course if course.status == "active" and student.course == None: student.course = course raise BusinessError("Student already in a course") Also, you will create a class to represent the persistence layer (database) and methods like list_students, get_course_by_name, and create_student to add, delete, update, and return data from the database. You will probably create facades to group the classes in a logical sequence, add more ifs, elses, and loops to control the code flow. At the end of the sprint, you have a scenario implemented and tested. There is nothing wrong with its style of implementation. It is a common process. However, something loses importance in this process: the business scenario itself. In this article, I’ll showcase a behavior-driven development approach that converts business languages directly to executable code. The intention is to keep the implementation closer to the business language and promote the code to the source of truth. Gherkin Scenarios for the Education Platform Going back to the fictional (not much) story. Here are some scenarios an education platform could have: Gherkin Feature: Student GPA and approval Scenario: Student is approved when GPA is 7 or higher and all subjects are passed Given a student named "John" is enrolled in the "Computer Science" course And the course has the subjects "Math", "Physics", and "Programming" And the student has the following grades: | Subject | Grade | | Math | 7 | | Physics | 8 | | Programming | 9 | When the system calculates the student's GPA Then the GPA should be 8 And the student status should be "Approved" Feature: Student enrollment in course subjects Scenario: Student cannot enroll in a subject from another course Given a student named "Carlos" is enrolled in the "Medicine" course And the subject "Algorithms" belongs to the "Computer Science" course When the student tries to enroll in the subject "Algorithms" Then the enrollment should be rejected And the system should show the message "Students can only enroll in subjects from their own course" Feature: Student enrollment in a course Scenario: Student enrolls in an active course Given a course named "Architecture" is active When a student named "Julia" tries to enroll in the "Architecture" course Then the enrollment should be accepted And the system should show the message "Student enrolled in course" Feature: Course cancellation Scenario: Students cannot enroll in a canceled course Given a course named "Architecture" has been canceled by the general coordinator When a student named "Julia" tries to enroll in the "Architecture" course Then the enrollment should be rejected And the system should show the message "Canceled courses cannot accept new enrollments" They are pretty, readable, easy to understand, and find inconsistencies. Now, a possible implementation as described in the previous story. It was simplified for the sake of this article. Let us look at a more traditional implementation. Python # Entities class Course: def __init__(self, course_id, name): self.course_id = course_id self.name = name self.is_canceled = False class Student: def __init__(self, student_id, name): self.student_id = student_id self.name = name self.course = None # Application class UniversityService: def __init__(self): self.courses = {} self.students = {} def create_course(self, course_id, name): self.courses[course_id] = Course(course_id, name) def create_student(self, student_id, name): self.students[student_id] = Student(student_id, name) def cancel_course(self, course_id): course = self.courses.get(course_id) if course is None: raise ValueError("Course not found") course.is_canceled = True def enroll_student_in_course(self, student_id, course_id): student = self.students.get(student_id) course = self.courses.get(course_id) if student is None: raise ValueError("Student not found") if course is None: raise ValueError("Course not found") if course.is_canceled: raise ValueError("Canceled courses cannot accept new enrollments") student.course = course # Scenario: Students cannot enroll in a canceled course service = UniversityService() service.create_course("C1", "Architecture") service.create_student("S1", "Julia") service.cancel_course("C1") try: service.enroll_student_in_course("S1", "C1") print("Unexpected result: student enrolled in a canceled course") except ValueError as e: print(e) It was done in a traditional style. Notice the technical references like service and the preconditions and business logic spread in many ifs in the code. We forgot to represent the system behavior in a simple and explicit way. The scenario was spread into many pieces, and it may be hard to put all of them together when we need to understand the code in the future. Consider that more features will be integrated into the code, and more if/else statements will be introduced to control the business logic and new flows. In summary, the scenario cannot be read as it was presented by the business team. It is hard to validate that the system is doing what it should do without proper unit tests and careful code review. We can try to test its integration with Python Behave to bring the explicit behavior back to the game, but it may be hard to do it without coming up against technical stuff like services. The system works, but it is hard to prove that it behaves as expected just by reading the code. At this point, the development team and the business team are not talking the same language anymore. There is a translation from business language to production code (technical stuff). Behavior-Driven Development Now, using the framework Guará to represent the scenarios directly in the code. The code now tells the story. For example, the scenario Student enrollment in a course can be written like this: Python from guara.application import Application eduapp = Application() ( eduapp.given(IsActiveCourse, course_id=course_id) .and_(IsNotStudentInACouse, student_id=student_id) .when( EnrollStudentInCourse, student_id=student_id, course_id=course_id, ) .then(it.IsEqualTo, "Student enrolled in course") ) The preconditions IsActiveCourse and IsNotStudentInACourse are now explicit and are at a higher level of the code. Not buried in the methods in the form of if conditionals. The precondition and action classes have single responsibilities. Python from guara.transaction import AbstractTransaction class IsActiveCourse(AbstractTransaction): def do(self, course_id): print(f"Checking the status of course {course_id}") status = database.courses.get_status(course_id=course_id) if status == "Active": return True raise CourseCanceledException("Course canceled") class IsNotStudentInACourse(AbstractTransaction): def do(self, student_id): print(f"Checking if student in a course") course = database.student.get_course() if course: raise StudentException("Student already in a course") class EnrollStudentInCourse(AbstractTransaction): def do(self, student_id, course_id): print(f"Enrolling student {student_id} in course {course_id}") status = database.enroll_course(course_id, student_id) return "Student enrolled in course" In the end, it is easier to compare the code against the scenario steps and assert they are present in the code. Python import argparse from guara.transaction import Application from guara import it eduapp = Application() def main(): parser = argparse.ArgumentParser() parser.add_argument("--action", required=True) parser.add_argument("--student-id") parser.add_argument("--course-id") args = parser.parse_args() if args.action == "enroll_course": try: ( eduapp.given(HasCourse, course_id=args.course_id) .and_(IsActiveCourse, course_id=args.course_id) .and_(HasStudent, student_id=args.student_id) .and_(IsNotStudentEnrolledInCourse, student_id=args.student_id) .when( EnrollStudentInCourse, student_id=args.student_id, course_id=args.course_id, ) .asserts(it.IsTrue) ) except Exception as e: print(str(e)) app.undo() # Calling the CLI python edu.py enroll-course --course-id 10 --student-id 1324 Benefits The production code is now the source of truthIt can be compared directly to the business scenariosThe responsibilities are encapsulated in dedicated classesIt is possible to undo operations easily once the framework is based on the Command Pattern (GoF)It is easy to add more behavior to the code without changing other classesThe classes are reusableIt hides the technical stuff. They still exist, but now the actions are first-class citizens Points of attention It is not a one-size-fits-all style. It is necessary to evaluate whether the system under development will benefit from this code styleMakes more sense when the scenarios are defined in Gherkin language; otherwise, it will be necessary to translate the requirement to code as done in the traditional implementation Conclusion The important difference is that the source code still reads almost like the original Gherkin scenario. Instead of hiding business rules inside technical layers, we keep them visible and explicit in the code.

By Douglas Cardoso

Machine Identity Debt: Why Human Identity Is No Longer Cloud Security's Primary Boundary

Cloud-native systems now create far more machine identities than human ones. Security strategies built around workforce identity are no longer sufficient. Here's what engineering leaders should build instead. The Breach That Didn't Need a Password On August 8, 2025, a threat actor now tracked by Google's Threat Intelligence Group as UNC6395 began quietly moving through the Salesforce instances of hundreds of companies. No phishing email landed in an inbox that day. No password was cracked. No multi-factor prompt was bypassed with a fatigue attack. The attacker simply had something better than a password: a valid OAuth token, stolen months earlier from Salesloft's GitHub account, that let it impersonate the Drift chatbot integration and act with all the trust that integration had been granted. Over the following ten days, the group ran automated Salesforce Object Query Language searches against more than 700 organizations — Cloudflare, Zscaler, Palo Alto Networks, and PagerDuty among them — harvesting account records, support case text, and, crucially, the AWS keys and Snowflake tokens that customers had pasted into support tickets months earlier. Google's investigation later found the same stolen tokens had reached into Google Workspace mailboxes too. Cory Michal, CSO at AppOmni, put his finger on what made the campaign notable: it wasn't a single lucky break but a methodical operation against hundreds of tenants using nothing but credentials the tenants themselves had issued to a vendor they trusted. That's the detail worth sitting with. Every access control that companies had built around human identity — MFA, conditional access, session monitoring, SSO — was irrelevant to this attack, because no human ever logged in. The identity that mattered was a machine's, and almost nobody was watching it the way they watch people. This wasn't an isolated case. In the same twelve months, a compromised API key issued to a DOE staffer gave a stranger standing access to more than 50 large language models at xAI — and stayed active for days after the exposure was discovered, according to reporting from KrebsOnSecurity. A supply-chain attack against the widely used tj-actions/changed-files GitHub Action, relied on by over 23,000 repositories, scraped AWS keys, GitHub tokens, npm credentials, and private RSA keys directly out of CI/CD workflow logs. GitGuardian's 2026 State of Secrets Sprawl report counted 28.65 million new hardcoded secrets pushed to public GitHub repositories in 2025 alone — a 34% jump year over year — and found that AI-assisted commits leak secrets at roughly twice the baseline rate of human-written ones. None of these incidents required a zero-day. They required an organization to have created a machine identity, granted it access, and then stopped paying attention to it. That is now the default failure mode of cloud security — and it's a failure mode that identity programs built for humans were never designed to catch. Section 1: Identity Has Already Changed Underneath Us For most of the last two decades, "identity and access management" meant managing people: employees, contractors, customers. A person logged in, proved who they were, and was granted access based on their role. The infrastructure existed to serve human judgment. That model quietly stopped matching reality. In a modern cloud environment, the majority of authentication events aren't between a person and a system — they're between systems. A pod in a Kubernetes cluster calls another pod. A CI/CD pipeline authenticates to a cloud provider to deploy an artifact. A SaaS integration holds an OAuth token that lets it act on a company's behalf indefinitely. Each of these is an identity in every meaningful sense — it can be granted permissions, it can be revoked, it can be stolen — but almost none of them are managed with the rigor applied to a human employee's badge. The mechanisms behind this shift are now familiar to anyone running production infrastructure: Kubernetes service accounts that authenticate workloads to the API server, workload identity federation that lets a pod assume a cloud IAM role without a stored credential, SPIFFE and SPIRE issuing cryptographically verifiable identities to workloads at runtime, OAuth client-credential grants powering service-to-service calls, and service meshes like Istio wrapping every internal request in mutual TLS. Layer on top of that the identities created by CI/CD systems, and it becomes clear that a mid-sized cloud environment can easily contain ten or twenty machine identities for every human one. Security researchers at IDMWorks, reviewing the identity breaches of the last three years for their 2026 NHI Reality Report, described the pattern bluntly: these attacks succeeded through poor governance, not sophisticated malware. There was no payload to detect — just valid credentials doing exactly what valid credentials are allowed to do. That's a much harder thing to catch than a virus, because there's nothing anomalous about the code path. The only thing that's wrong is which entity is walking it. Section 2: Why the Existing Security Model Fails Here Identity and access management built for people assumes a handful of things that simply don't hold for machines. It assumes credentials are issued to a known, accountable owner. It assumes a login event is rare enough to be worth alerting on. It assumes a compromised credential will eventually show up in unusual behavior — an impossible-travel alert, an after-hours login, a new device. None of that transfers cleanly to a service account. IDMWorks' research is direct about the resulting blind spot: a service account that authenticates ten thousand times a day isn't behaving anomalously — that's just Tuesday. Detecting misuse requires knowing what a credential is supposed to be doing well enough to notice a deviation, and almost no organization has that baseline built for its non-human identities the way it does for its people. The ownership problem compounds this. Aembit's running catalog of non-human identity breaches documents a 2025 flaw in a major identity provider that let anyone holding a valid API key enumerate every OIDC application in a tenant and pull its client secrets — a bug that, if exploited, would have let an attacker impersonate entire applications and move laterally across an organization's stack. It was responsibly disclosed and patched, but it illustrates how identity providers themselves can become breach multipliers the moment a machine credential leaks. Then there's lifespan. GitGuardian's research, cited in Snyk's 2026 analysis of the secrets sprawl problem, found that private repositories are six times more likely to contain hardcoded secrets than public ones — largely because private repos get cloned, forked, and handed to contractors without anyone revisiting what's inside them. And because git's data model is append-only, a secret committed and later deleted in a follow-up commit should still be treated as exposed; it lives on in history whether or not it's still visible in the latest diff. The legal exposure is no longer theoretical, either. In United States v. Sullivan, Uber's former Chief Security Officer was criminally convicted of obstruction of justice for concealing a 2016 breach that began with hardcoded AWS credentials sitting in a GitHub repository — credentials that let attackers pull data on 57 million riders and drivers. The Ninth Circuit's 2025 ruling upheld that conviction, establishing that executives can face personal criminal liability for how they respond to a credential-based breach, not just for the breach itself. That should recalibrate how seriously engineering leadership treats "just another leaked API key." An Honest Name for the Problem: Machine Identity Debt Engineering teams already have a vocabulary for the gap between "shipped quickly" and "built correctly" — they call it technical debt. There's no equivalent term for the identical pattern happening in identity, so let me propose one: machine identity debt. Technical debt accumulates in code: shortcuts taken under deadline pressure that someone eventually has to pay down. Identity debt accumulates in trust: every API key issued and never revisited, every OAuth grant approved by someone who's since left the company, every IAM role created with "just give it admin, we'll fix it later" and never fixed. None of it shows up in a sprint retro. None of it fails a build. It just sits there, compounding, until an attacker finds it and collects the interest all at once — which is close to a literal description of what happened to the 700-plus organizations caught in the Salesloft Drift breach, where OAuth grants approved months or years earlier turned out to still carry far more reach than anyone had tracked. A rough way to think about what's accumulating: Plain Text Machine Identity Debt ≈ long-lived credentials with no expiration policy + service accounts no longer tied to an active workload + OAuth grants no one has reviewed since approval + secrets discovered in tickets, chat, and docs rather than a vault + IAM roles scoped broader than the task requires + any machine identity with no accountable human owner This isn't a precise formula you can drop into a dashboard query today — treat it as a checklist for a conversation, not a KPI. But naming each line item is useful, because each one is independently measurable, and most organizations have never measured any of them. When enough of this debt accumulates that nobody can produce an accurate answer to "what machine identities exist, who owns them, and what can they reach" — that's not an IAM maturity gap anymore. It's identity bankruptcy: the point where inventory, ownership, and trust have diverged so far from reality that incremental cleanup stops being realistic and the organization needs a forced reconciliation, usually triggered by an incident rather than a planning cycle. The mechanism that gets organizations there is worth naming too. Every new SaaS integration, every GitHub Action, every Terraform module, every AI agent granted API access mints a new unit of trust — a new thing the organization implicitly promises to govern. Nobody budgets for governing it; the integration just gets approved because it unblocks a project. Multiply that across a growing stack and you get something like trust inflation: the total quantity of trust an organization has extended growing faster than its ability to actually track or revoke any single unit of it. Eventually a credential's nominal access — what the ticket said it was for — and its real access — everything it can actually still reach — drift far enough apart that the gap itself becomes the attack surface. None of these terms are industry standard — I'm proposing them here because the pattern needed a name and didn't have one. Judge them by whether they make the problem easier to talk about, not by whether you've heard them before. Section 3: A New Boundary — Adaptive Machine Trust Architecture If the perimeter used to be defined by "who logged in," it now has to be defined by a different question: can this specific workload be trusted, right now, to do the specific thing it's asking to do? That's a shift from identity as a static credential to identity as a continuously re-evaluated claim. A workable framework for this — call it Adaptive Machine Trust Architecture, or AMTA — rests on a small number of principles that reinforce each other: Continuous verification. A workload's identity is checked at the moment of each request, not once at startup. Trust isn't a badge you're handed at the door; it's re-earned per transaction. Cryptographic workload identity. Instead of a static API key sitting in an environment variable, a workload is issued a short-lived, cryptographically verifiable identity document — the SPIFFE Verifiable Identity Document (SVID) model is the clearest existing implementation of this idea — that ties the identity to what the workload is, not to a secret it happens to be holding. Just-in-time authorization. Access is granted for the duration of a task and expires automatically, rather than being provisioned once during a rushed deployment and left in place indefinitely, which is precisely the pattern IDMWorks identified as the root cause of most CI/CD credential compromises. Policy-driven trust decisions. Authorization decisions are externalized to a policy engine that can evaluate context — the requesting workload's identity, its recent behavior, the sensitivity of the resource — rather than being baked into application code as a hardcoded allow-list. Identity lifecycle management. Every machine identity has a documented owner, a defined purpose, and an expiration path. The absence of exactly this — what IDMWorks calls "no ownership model" — is the single most commonly cited root cause across the non-human identity breaches of the last three years. Continuous attestation. The system periodically re-proves that a workload is still what it claims to be — still running the expected code, in the expected environment — rather than trusting a credential indefinitely once it's issued. None of these principles is exotic on its own. What's new is treating them as a single coherent architecture for machine trust, instead of a scattered collection of best practices that get implemented inconsistently across teams. Section 4: What Implementation Actually Looks Like The tooling to build this exists today, and it's more mature than most security teams realize. SPIFFE and its reference implementation, SPIRE, provide the identity layer: workloads receive short-lived X.509 or JWT SVIDs based on attested properties of the environment they're running in — the specific pod, the specific node, the specific Kubernetes namespace — rather than a secret baked into a config file. A workload requesting an SVID doesn't present a password; it presents proof of what it is, and SPIRE's server verifies that against a registration policy before issuing anything. In a service mesh like Istio, this identity layer can be paired with mutual TLS enforced at the sidecar proxy, so every service-to-service call is authenticated and encrypted without the application code needing to know anything about certificates. Authorization decisions can be externalized to Open Policy Agent, letting teams write access policy as code — reviewable, versioned, testable — instead of scattering if user.role == 'admin' checks through a codebase. For software supply chain integrity — relevant given that the tj-actions/changed-files compromise spread through a CI/CD pipeline — Sigstore's Cosign and Fulcio provide a way to sign build artifacts and verify their provenance using short-lived certificates tied to an OIDC identity, rather than a long-lived signing key that itself becomes another secret to protect. None of this is a rip-and-replace project. Teams typically start by identifying their highest-value machine credentials — the ones with production database access, the ones with broad cloud IAM permissions — and migrating those first to short-lived, attested identities, while instrumenting logging so that every machine identity's access can actually be reviewed rather than assumed. What This Looks Like When Someone Actually Ships It The architecture described above isn't hypothetical. Pinterest has publicly documented using SPIFFE alongside its internal secrets-management system, Knox, specifically to solve identity in a multi-tenant environment where workloads from different teams share infrastructure and can't be trusted by network location alone. Square presented its adoption of SPIFFE and SPIRE at a SPIFFE Community Day, describing how it used the framework to secure communication across a hybrid infrastructure — cloud and on-premises systems that previously had no consistent way to authenticate to each other. Uber's security team gave a KubeCon talk walking through why it built an internal workload identity platform on these same principles, and ByteDance has separately documented replacing a homegrown certificate system with SPIRE to get PKI-based authentication working at the scale TikTok's infrastructure requires. The common thread across all four is the same one this piece has argued from the incident side: none of them adopted workload identity because a compliance checkbox required it. They adopted it because operating at their scale made network-location-based trust and long-lived shared secrets genuinely unworkable — the same pressure that's now reaching far smaller organizations as their own machine identity counts climb. The trade-off they all had to work through in public is worth naming honestly: SPIRE introduces real operational overhead — a server and agent fleet to run, node and workload attestation to configure correctly for each hosting environment, and a learning curve for teams used to thinking about secrets rather than attested identity. None of the public talks describe it as a drop-in replacement. They describe it as an infrastructure investment that pays off once the number of services and the rate of change outgrow what static credentials can manage safely. Section 5: The Metrics That Actually Indicate Progress Security leaders asking for budget need numbers, not architecture diagrams. The ones worth tracking: Mean credential lifetime – how long, on average, does a machine credential remain valid before rotation or expiration? GitGuardian's finding that some leaked keys remained live for months is really a mean-lifetime failure.Percentage of workloads using attested workload identity versus static, long-lived secrets – this is the single clearest proxy for how exposed an environment is to the failure pattern behind the xAI and tj-actions incidents.Secret rotation frequency, measured against an actual policy rather than an aspirational one.Unauthorized service-to-service request rate – a signal that requires the behavioral baselining IDMWorks flagged as largely absent today.Credential exposure rate in code, tickets, and chat – Snyk's research found leaks occurring in Slack messages, Jira tickets, and Confluence pages at meaningful rates, not just in source code, so this metric has to look beyond the repository.Policy compliance rate for third-party OAuth integrations – the exact control gap that let the Salesloft Drift tokens retain broad, long-lived access to Salesforce, Google Workspace, and AWS simultaneously. The Reports Keep Saying the Same Thing Independently It's worth pausing on how many separate organizations, using separate datasets, landed on the same conclusion in the same twelve-month window. Verizon's 2025 Data Breach Investigations Report — built from 22,052 incidents across 139 countries, the kind of dataset no single vendor could assemble on its own — found 441,780 exposed secrets sitting in public code repositories, with a median remediation time of 94 days once discovered. Nearly half of those were high-privilege Google Cloud API keys tied to automated infrastructure, not human logins. GitGuardian's own 2026 research, working from a different pipeline entirely, arrived at the same order of magnitude: 28.65 million new hardcoded secrets added to public GitHub in 2025 alone. IDMWorks, analyzing three years of non-human identity incidents rather than scanning code, described the same underlying failure in different language: no ownership model, no rotation cadence, detection tooling built for human login patterns that generates nothing but noise against machine behavior. Snyk's research adds the vector most of these reports don't emphasize enough — over a quarter of credential incidents originate entirely outside source code, in Slack messages, Jira tickets, and Confluence pages. Different data sources, different methodologies, different commercial incentives — and all of them converge on the same sentence: non-human identities are growing faster than the governance built to manage them. That's not a marketing claim from any one vendor. It's what independent datasets keep saying when you line them up next to each other. What the Trajectory Actually Implies I won't pretend to know the machine-to-human identity ratio a cloud-native enterprise will have in 2030 — nobody has the longitudinal data to state that number with confidence, and treating a guess as a fact would undercut everything else in this piece. What can be said with more confidence is the direction and the reason. Every driver behind today's machine identity growth — CI/CD automation, service mesh adoption, multi-cloud workload identity, and now AI agents authenticating to APIs on an organization's behalf — is accelerating, not leveling off. AI agents in particular are a new category of machine identity, not just more volume in an existing one: an agent can be granted a credential, use it in ways its creator never explicitly authorized, and, in the case of the Common Crawl training-data exposure, potentially reproduce a credential it was never supposed to have seen in the first place. If the ratio of machine to human identities is already in the double digits at a typical mid-sized cloud shop today, as the SPIFFE/SPIRE and workload-identity adoption patterns suggest, then adding an autonomous-agent layer on top doesn't nudge that ratio — it compounds it. The honest prediction isn't a specific number for 2030. It's that any organization treating machine identity governance as a 2026 problem to revisit later is already behind a curve that isn't slowing down. A Rough Map of How Organizations Get Here Most organizations don't leap from careful to reckless. They drift through recognizable stages, usually without anyone deciding to: Plain Text Centralized human IAM ↓ Cloud IAM roles multiply per service ↓ Service accounts proliferate, ownership blurs ↓ Workload identity adopted for some, not all, systems ↓ AI agents added as a new identity class ↓ Identity sprawl outpaces any team's ability to inventory it ↓ Machine Identity Debt crosses into Identity Bankruptcy ↓ Incident forces the reconciliation that governance should have Most organizations reading this are somewhere between stage two and stage five. Very few have consciously decided which stage they're in — which is itself the point: nobody plans to reach identity bankruptcy; they just never stop to check how much debt they've taken on since the last audit. Section 6: Where This Goes Next A few developments will make machine trust an even sharper problem over the next few years rather than a solved one. Confidential computing — running workloads inside hardware-enforced trusted execution environments — is moving from research curiosity to something cloud providers offer as a standard instance type, which will let attestation extend down to the hardware layer rather than stopping at the software identity. AI agents that authenticate to APIs and take autonomous action on an organization's behalf are a new and rapidly growing category of machine identity, and the data-poisoning risk is already visible: Truffle Security's scan of Common Crawl's December 2024 archive, covering roughly 400 terabytes of public web data, found close to 12,000 live, working credentials embedded in text that's now part of the training data feeding future models. An AI system trained on that data can, in principle, reproduce or act on a credential it was never supposed to have. Post-quantum cryptography considerations will eventually reach workload identity systems, since the SVIDs and certificates underpinning frameworks like SPIFFE rely on cryptographic assumptions that are being reassessed industry-wide. And identity graphs — mapping which machine identities can reach which resources, and through which chains of trust — are becoming the tool that lets a security team answer the question that mattered most in the Salesloft Drift breach: not "was this OAuth token valid," but "what could this token reach, and did anyone actually decide it should be able to?" The Question Worth Asking The organizations that got hit in 2025 weren't running unpatched software or ignoring known vulnerabilities. Cloudflare, Palo Alto Networks, and Zscaler — security vendors with mature programs — were among the hundreds caught in the Salesloft Drift breach. The tokens that got them were valid. The access was, technically, authorized. That's what makes machine identity the harder problem: it doesn't fail loudly. The practical shift for engineering leadership is to stop asking "who is the user?" as the primary security question and start asking "can this workload be trusted right now, for this specific action?" That means building ownership records for every service account before an incident forces the question, migrating high-value credentials to short-lived attested identities before a leaked key becomes a header on KrebsOnSecurity, and treating third-party OAuth grants with the same scrutiny given to a new employee's laptop. None of this is speculative. Every incident cited here happened in the past eighteen months, to organizations with real security budgets. The architecture to prevent the next one already exists. What's missing, in most companies, is the decision to build it before the postmortem forces the issue. The pattern underneath every breach in this piece is the same one: a credential nobody was actively watching, doing exactly what it was built to do, for whoever happened to be holding it. Passwords get the attention because a stolen password is a story people understand — a human made a mistake, or got tricked. A stolen service-account token is a harder story to tell, because the mistake happened months earlier, in a decision nobody remembers making, and the debt just sat there accruing until someone else cashed it in. Paying that debt down before it's due is a less dramatic project than responding to a breach. It's also the only version of this problem that ends with a postmortem you never have to write. Sources Google Cloud / Google Threat Intelligence Group, "Widespread Data Theft Targets Salesforce Instances via Salesloft Drift," August 26, 2025The Hacker News, "Salesloft OAuth Breach via Drift AI Chat Agent Exposes Salesforce Customer Data," August 28, 2025Anomali, "Reviewing the Salesforce–Salesloft Drift OAuth Supply Chain Breach," December 2025Guardz, "The Salesloft Drift Breach and the Impact on Google Workspace," September 2025Defakto, "xAI API Key Leak by DOGE Staffer Reveals Cracks in API Security," December 2025Snyk, "Why 28 million credentials leaked on GitHub in 2025, and what to do about it," March 2026Aembit, "Real-Life Examples of Non-Human Identity Security Breaches," updated regularlyIDMWorks, "When Service Accounts Attack: How Identities are Weaponized," May 2026PointGuard AI, "AI Training Data Secret Leak 2025 | 12,000 API Keys Exposed," January 2026CybelAngel, "API Threat Report 2025: Key Findings for Security Teams," March 2026United States v. Sullivan, 9th Cir. 2025 (referenced via Snyk's legal-consequences analysis, above)Verizon, "2025 Data Breach Investigations Report" (18th edition; 22,052 incidents, 139 countries)GitGuardian, "The Secrets Sprawl is Worse Than You Think: Key Takeaways from the 2025 Verizon DBIR," April 2025SPIFFE Project, "Case Studies" (Pinterest, Square, Uber, ByteDance talks)

By Igboanugo David Ugochukwu

CORE

Database Normalization, ACID Properties, and SCDs: A Comprehensive Guide

Database Normalization: Balancing Structure and Performance Normalization is a systematic approach to organizing database structures to minimize redundancy and improve data integrity. While theoretical normalization extends to six normal forms, most real-world database implementations target the third normal form (3NF) as the optimal balance between structural integrity and performance. Benefits and Drawbacks of Normalization AdvantagesDisadvantagesMinimizes data redundancyMay require complex joinsPrevents update anomaliesCan impact query performanceEnhances data consistencyMay increase development complexityReduces storage requirementsRequires more tables to represent relationshipsSimplifies data maintenanceMay require more complex indexing strategies First Normal Form (1NF) Definition: A table is in 1NF when all columns contain atomic (indivisible) values, and there are no repeating groups. Key principle: Each column must contain only one value for each row. Example: Consider a university database tracking student contact information: Before 1NF (Non-normalized): StudentIDNameEmailS001John Smith,S002Maria Garcia The above table violates 1NF because the Email column contains multiple values for student S001. After 1NF: Students table: StudentIDNameS001John SmithS002Maria Garcia Student emails table: StudentIDEmailS001S001S002 This approach resolves the multi-valued attribute issue by creating a separate table for email addresses, ensuring each cell contains exactly one value. Second Normal Form (2NF) Definition: A table is in 2NF when it is in 1NF, and all non-key attributes are fully functionally dependent on the primary key. Key principle: No partial dependencies on a composite primary key. Example: Consider a database tracking university course enrollments: In 1NF but not 2NF: StudentIDCourseIDCourseNameInstructorIDInstructorNameEnrollmentDateS001C101Database DesignI201Dr. Anderson2023-01-15S001C102Data StructuresI202Dr. Zhang2023-01-16S002C101Database DesignI201Dr. Anderson2023-01-14 The primary key is the composite (StudentID, CourseID), but CourseName depends only on CourseID, not the full primary key. Similarly, InstructorName depends only on InstructorID. After 2NF: Enrollments table: StudentIDCourseIDEnrollmentDateS001C1012023-01-15S001C1022023-01-16S002C1012023-01-14 Courses table: CourseIDCourseNameInstructorIDC101Database DesignI201C102Data StructuresI202 Instructors table: InstructorIDInstructorNameI201Dr. AndersonI202Dr. Zhang This design eliminates partial dependencies by creating separate tables for courses and instructors, ensuring all attributes in each table depend on the entire primary key. Third Normal Form (3NF) Definition: A table is in 3NF when it is in 2NF, and no non-key attribute depends on another non-key attribute (no transitive dependencies). Key principle: No transitive dependencies. Example: Continuing with our university database: In 2NF but not 3NF: Courses table: CourseIDCourseNameDepartmentDepartmentHeadC101Database DesignComputer ScienceDr. JohnsonC102Data StructuresComputer ScienceDr. JohnsonC103Organizational BehaviorBusinessDr. Williams Here, DepartmentHead depends on Department, not directly on the primary key CourseID. After 3NF: Courses table: CourseIDCourseNameDepartmentIDC101Database DesignD001C102Data StructuresD001C103Organizational BehaviorD002 Departments table: DepartmentIDDepartmentDepartmentHeadD001Computer ScienceDr. JohnsonD002BusinessDr. Williams This restructuring eliminates transitive dependencies by creating a separate Departments table. The APT Mnemonic: Remembering Normal Forms A simple way to remember the first three normal forms is using the mnemonic APT: A – Atomic values (1NF)P – Partial dependencies eliminated (2NF)T – Transitive dependencies eliminated (3NF) De-Normalization: When Performance Matters More De-normalization intentionally introduces redundancy to improve query performance, particularly beneficial for read-heavy analytical workloads. Key Use Cases Data warehousingBusiness intelligence systemsReporting applicationsOLAP (Online Analytical Processing) Example: Consider a de-normalized sales reporting table: OrderIDCustomerNameRegionProductIDProductNameCategoryQuantityUnitPriceTotalAmountOrderDateO1001Acme CorpWestP101ServerHardware23000.006000.002023-03-15O1002TechSoftEastP102Database LicenseSoftware10500.005000.002023-03-16 This single table stores redundant information (like ProductName, Category, etc.) but enables faster reporting queries by eliminating joins. ACID Properties: Ensuring Transaction Reliability ACID properties are fundamental guarantees provided by database management systems to ensure reliability during transaction processing. Atomicity Definition: A transaction must be treated as an indivisible unit. Either all operations succeed, or none take effect. Example: Consider a banking application transferring funds between accounts: SQL BEGIN TRANSACTION; -- Deduct $500 from savings account UPDATE Accounts SET Balance = Balance - 500 WHERE AccountID = 'SAV-1001' AND AccountType = 'Savings'; -- Add $500 to checking account UPDATE Accounts SET Balance = Balance + 500 WHERE AccountID = 'CHK-2001' AND AccountType = 'Checking'; -- Record the transfer in transactions history INSERT INTO Transactions ( TransactionID, FromAccount, ToAccount, Amount, TransactionDate ) VALUES ( NEWID(), 'SAV-1001', 'CHK-2001', 500, GETDATE() ); COMMIT; If any step fails (e.g., due to a constraint violation or system error), the entire transaction is rolled back, ensuring the balance remains consistent across accounts. Consistency Definition: Transactions must transform the database from one valid state to another, maintaining all predefined integrity constraints. Example: In an inventory management system: SQL BEGIN TRANSACTION; -- Customer orders 5 units of product 'P1001' INSERT INTO Orders ( OrderID, CustomerID, ProductID, Quantity ) VALUES ( 'ORD-5001', 'CUST-101', 'P1001', 5 ); -- Update inventory (assume current stock is 3 units) UPDATE Inventory SET QuantityInStock = QuantityInStock - 5 WHERE ProductID = 'P1001'; COMMIT; If there's a constraint that prevents negative inventory, this transaction will fail because it would result in -2 units in stock. The database remains in a consistent state by preventing the invalid transaction. Isolation Definition: Concurrent transactions must not interfere with each other, with each transaction acting as if it were the only operation being performed on the database. Example: Two transactions attempting to update the same customer record: Transaction 1: SQL BEGIN TRANSACTION; UPDATE Customers SET CreditLimit = CreditLimit + 1000 WHERE CustomerID = 'CUST-101'; -- Other operations... COMMIT; Transaction 2: SQL BEGIN TRANSACTION; UPDATE Customers SET Status = 'Premium' WHERE CustomerID = 'CUST-101'; -- Other operations... COMMIT; With proper isolation, the final state of the customer record will include both changes, regardless of execution order, preventing lost updates or inconsistent reads. Durability Definition: Once a transaction is committed, its effects must persist even in the event of system failures. Example: After a completed payment transaction: SQL BEGIN TRANSACTION; -- Process payment UPDATE Orders SET PaymentStatus = 'Paid' WHERE OrderID = 'ORD-5001'; -- Record payment in financial system INSERT INTO Payments ( PaymentID, OrderID, Amount, PaymentDate ) VALUES ( 'PAY-9001', 'ORD-5001', 1250.00, GETDATE() ); COMMIT; After commit, the payment record must persist even if there's a power outage or system crash. This is typically achieved through write-ahead logging, transaction logs, and database recovery mechanisms. Slowly Changing Dimensions (SCDs): Managing Historical Data SCDs are techniques used in data warehousing to manage dimension attributes that change over time, enabling historical analysis and reporting. SCD Type 0: Retain Original Definition: No changes are made to historical data once loaded. Example: A ProductID dimension where the assigned identifier never changes. Plain Text ProductID: 1001 SKU: "WIDGET-A" Category: "Hardware" Even if the product's category changes in the source system, the data warehouse retains the original classification for consistency in historical reporting. SCD Type 1: Overwrite Definition: The current value replaces the previous value without maintaining history. Example: A customer's contact information that needs to be current but doesn't require historical tracking. Before Update: Plain Text CustomerID: C1001 Name: John Smith Email: [email protected] Address: 123 Main St After Update: Plain Text CustomerID: C1001 Name: John Smith Email: [email protected] Address: 456 Oak Ave The previous email and address are completely overwritten, leaving no record of the historical values. SCD Type 2: Add New Row Definition: Maintains full history by adding new records with effective date ranges. Example: Employee position changes within an organization. EmployeeIDVersionIDNameDepartmentPositionEffectiveStartDateEffectiveEndDateIsCurrentE1011Sarah JohnsonMarketingSpecialist2022-01-152023-03-31FalseE1012Sarah JohnsonMarketingManager2023-04-01NULLTrue This approach allows querying the employee's position at any point in time, supporting historical analysis. SCD Type 3: Add New Attribute Definition: Maintains limited history by adding columns for previous values. Example: A product dimension tracking category changes: ProductIDProductNameCurrentCategoryPreviousCategoryCategoryChangeDateP1001Widget XElectronicsHardware2023-02-15P1002Gadget YAccessoriesNULLNULL This method preserves only the most recent change but provides a simple way to track when the change occurred. Choosing the Right Approach When to Normalize For transactional systems (OLTP)When data integrity is paramountWhen storage efficiency mattersWhen changes are frequent When to De-Normalize For analytical systems (OLAP)When query performance is criticalWhen reads significantly outnumber writesFor reporting databases When to Use Different SCD Types Type 0: For dimensions that never changeType 1: For dimensions where only current values matterType 2: When complete historical tracking is requiredType 3: When limited historical tracking is sufficient Conclusion Database design requires balancing competing concerns: structural integrity, performance, historical tracking, and transactional reliability. By understanding normalization principles, ACID properties, and SCD techniques, database professionals can make informed decisions that serve their specific application requirements while maintaining data quality and system performance. These concepts form the backbone of successful database architecture, allowing systems to efficiently manage the ever-increasing volume and complexity of modern data landscapes.

By arvind toorpu

CORE

Top 10 Best Places to Prepare for Your Next Data Engineer Interview

Landing a data engineering role means clearing a gauntlet that no other software discipline has to face all at once: airtight SQL, production-grade Python, data modeling instincts, distributed-compute fluency (Spark, warehouses, ETL), and system design that has to survive real data volume. Generic coding prep barely scratches the surface, and "just grind LeetCode" advice falls apart the moment an interviewer asks you to model a slowly changing dimension or reason about a skewed join. So we did the work. We evaluated the resources data engineers actually use, judged on five things that matter: relevance to the DE interview loop, depth of practice, realism of the questions, feedback quality, and price. Below is the ranked list. A quick note on methodology: this ranking favors resources that target the data engineering loop specifically, not generic algorithm grinding. That bias is intentional, and it is why the order may surprise you. 1. DataDriven.io Most "interview prep" platforms were built for generic SWE roles and bolt on a SQL section as an afterthought. This one was built from the ground up for the data engineering loop. The catchphrase you will hear repeated in DE communities is that DataDriven.io is LeetCode for data engineers, and it fits: instead of inverting binary trees, you are writing window functions against realistic schemas, designing star schemas, debugging an ETL transform, and reasoning about partitioning, all in an in-browser SQL and Python sandbox that runs your query against real data and tells you exactly where it broke. It is also the rare place where the whole product is built for the job rather than adjacent to it, which is why datadriven.io is great for data engineer interview prep specifically: SQL practice that ramps to multi-CTE analytics, a deep set of Python practice problems, plus data modeling, dimensional modeling, PySpark, and system-design tracks, with execution-based feedback and a difficulty curve that reaches the staff-level questions that actually separate offers from rejections. Verdict: The most targeted, realistic data engineering interview practice available today. Earns the top spot. 2. "Cracking the Coding Interview" (the book, by Gayle Laakmann McDowell) A deserved classic, and intentionally a book rather than a website. CTCI is still the best single artifact for understanding how technical interviews are actually structured: how the conversation flows, how to think out loud so the interviewer can follow your reasoning, how to recover when you get stuck, and how to handle the behavioral and negotiation segments that strong candidates routinely fumble. Most people lose offers not because they could not solve the problem but because they could not show their work, and this book is the canonical fix for that. Where it falls short for our purposes is scope. It will not teach you windowed SQL, slowly changing dimensions, or how to design a lakehouse, and its algorithm focus skews toward generalist software roles rather than the data engineering loop. The data structures and big-O chapters are still worth a pass because algorithm screens do show up, but treat them as a refresher, not your main event. Read CTCI once early in your prep to fix your interview mechanics, internalize the communication patterns, then spend the rest of your time on hands-on, domain-specific platforms. Verdict: Essential reading for interview mechanics; not a substitute for domain practice. 3. "Designing Data-Intensive Applications" (the book, by Martin Kleppmann) If CTCI teaches you how to interview, "DDIA" teaches you what a data engineer is actually supposed to know. Replication, partitioning, consistency models, batch versus stream processing, storage engine internals, the failure modes of distributed systems: this is the conceptual backbone of nearly every data engineering system design round. When an interviewer asks why you would choose a log-structured merge tree over a B-tree, or how you would keep two datastores in sync without losing events, the answers live in these pages. It is dense, and it is emphatically not an interview drill book. You will not find practice questions, and you cannot cram it the night before. What it gives you instead is judgment: the candidate who has internalized DDIA answers "how would you design this pipeline" with the calm of someone who has already thought through the tradeoffs, names the failure cases before being prompted, and explains why a choice holds up under real data volume. Read it slowly over weeks, ideally early in your prep, and pair it with a hands-on platform so the concepts attach to actual queries and schemas rather than floating as theory. Verdict: The definitive conceptual reference. Read it slowly, alongside real practice. 4. LeetCode The default destination, and it earns its spot for one practical reason: the Database problem set is sizable, the algorithm catalog is enormous, and the platform's brand means a large share of companies still pull their initial coding screen straight from it. If your target company is known to run a generic algorithm round before the data-specific rounds, you need exposure here, and the sheer volume of problems plus community discussion means you will rarely be surprised by a pattern you have never seen. The catch for data engineers is that LeetCode was built for the algorithm interview, not the DE loop. Its SQL section is genuinely solid but secondary; the questions are puzzle-shaped rather than drawn from real schemas, and you will not find data modeling, ETL design, dimensional modeling, or Spark anywhere on the platform. There is also a real failure mode here: candidates over-invest in LeetCode because it is comfortable and gamified, then walk into a DE loop under-practiced on the things that actually decide it. Use it deliberately to clear the algorithm gate and to keep your raw coding sharp, then move the bulk of your hours to resources that target data engineering directly. Verdict: Necessary for the algorithm screen; thin for the data-engineering-specific rounds. 5. HackerRank HackerRank is where a surprising number of companies host their take-home and timed online assessments, so practicing in its environment carries a payoff most resources cannot offer: you get comfortable with the exact editor, the exact test-case runner, and the exact time-pressure UI you may actually be scored in. For an assessment you cannot retake, that familiarity is worth real points, because fighting an unfamiliar interface while the clock runs is a self-inflicted way to lose. Its SQL and problem-solving tracks are beginner-friendly, well-structured, and free to work through. The ceiling, though, is lower than you want for a senior DE loop. The problems lean academic and self-contained rather than job-realistic, the SQL rarely reaches the messy multi-table analytics that real interviews probe, and there is nothing on modeling, pipelines, or system design. The smart way to use HackerRank is as format rehearsal: run a few timed sets so the assessment environment feels routine, then build your actual depth somewhere that mirrors the work. Do not let a green checkmark on an easy problem set convince you that you are loop-ready. Verdict: Great for getting comfortable with the testing environment; limited depth. 6. SQLZoo A long-running, completely free interactive SQL tutorial that runs entirely in the browser with no signup, no setup, and no paywall. It walks you from SELECT basics through joins, grouping, subqueries, and window functions, with short hands-on exercises after each concept so you are writing real queries from the first lesson rather than just reading about them. For anyone whose SQL has gone rusty, or who learned it informally and has gaps they cannot quite name, it is the most painless way to rebuild muscle memory before stepping up to interview-grade problems. It is a teaching tool, not an interview platform, and you should treat it as exactly that. The problems stay introductory, the datasets are small and tidy, and there is nothing on data modeling, ETL, pipelines, or system design — the parts of the loop that actually separate data engineers from analysts. Its value is as a fast diagnostic and warm-up: work through the sections that feel shaky, confirm your fundamentals are solid, then graduate to harder, execution-based practice against realistic schemas. Linger here too long, and you will plateau well below where a real interview will push you. Verdict: A friendly free SQL primer; foundational rather than interview-level. 7. "Python for Data Analysis" (by Wes McKinney) Written by the creator of pandas, this is the reference for the kind of data-wrangling Python that shows up constantly in DE take-homes and pairing rounds: reshaping, grouping and aggregating, merging on imperfect keys, handling missing values, parsing dates, and cleaning the kind of messy tabular data that never looks like a tidy LeetCode input. Many data engineering interviews quietly assume this fluency, then hand you a notebook and a dirty CSV and watch how you move; if your Python is sharp on algorithms but clumsy on real data manipulation, this book is exactly the gap-closer. It is a library-and-technique book, not interview prep, and it will not touch SQL, data modeling, distributed compute, or system design. There are also no interview questions to grind, which is fine, because its job is to make the tools second nature so that during a timed exercise you are reasoning about the problem instead of fumbling for the right pandas idiom. Read the chapters on data loading, cleaning, and group operations, keep it nearby as a reference, then go apply the techniques in hands-on practice against problems that actually resemble the job. Verdict: The definitive practical Python reference for data work; not a drill book. 8. "Fundamentals of Data Engineering" (the book, by Joe Reis & Matt Housley) Another deliberate book pick, and the best single survey of the modern data engineering lifecycle: generation, ingestion, storage, transformation, and serving, plus the cross-cutting concerns like orchestration, data quality, and governance that interviewers increasingly probe. Where DDIA goes deep on systems internals, this book goes broad on how the pieces fit together into a working data platform, which is precisely the framing you want for the "walk me through how you'd build X" and "what would you consider before choosing this approach" portions of a loop. It is a framework-and-vocabulary book, not a practice book, and that is both its strength and its limit. It will give you the mental model and the shared language to discuss tradeoffs like a practitioner, which makes you sound, accurately, like someone who understands the field. But it contains no exercises, so reading it alone will not build the hands-on skill an interviewer also tests. Use it to organize everything you know into a coherent lifecycle, fill the conceptual gaps, then go write the queries and design the schemas somewhere that gives you real feedback. Verdict: The best lifecycle overview in print; conceptual, not hands-on. 9. Mode SQL Tutorial A free, well-regarded interactive SQL tutorial built by an analytics company, which shows in its framing: it teaches SQL the way analysts and engineers actually use it, oriented around answering real questions from data rather than solving abstract puzzles. It runs in the browser, takes you from the basics through intermediate analytics queries including aggregation and the early window-function territory, and the explanations are unusually clear about why a query is shaped the way it is. For someone shoring up SQL foundations before diving into harder problems, it is one of the cleanest no-cost on-ramps available. Like SQLZoo, it is a tutorial rather than an interview-prep platform, so it stops well short of the difficulty a real DE loop will throw at you, and it covers none of the modeling, pipeline, or system-design ground. It is best read as a companion to a hands-on platform: use Mode to internalize the analytical mindset and clean up your SQL fundamentals, then take that foundation into execution-based practice where the problems are harder, the schemas messier, and the feedback tells you exactly where your query went wrong. Verdict: A clean free SQL on-ramp; foundational rather than interview-level. 10. Pramp/Interviewing.io (mock interviews) Rounding out the list: peer and expert mock interviews. All the solo practice in the world cannot reproduce the specific pressure of explaining your reasoning out loud to a real human while a clock runs and someone is judging you, and that pressure is exactly where otherwise-prepared candidates fall apart. A handful of mock loops surface the weaknesses you cannot see in yourself: the long silences, the jumping to code before clarifying the question, the inability to narrate a tradeoff. Pramp pairs you with peers for free, while Interviewing.io connects you with experienced interviewers, often anonymously, for higher-fidelity feedback. The honest limitation is supply and specificity. Data-engineering-focused interviewers are scarcer than generalist software ones, so depending on availability, you may land in an algorithm or general system-design mock that only partially mirrors a true DE loop. That is still worth doing, because the communication skills, the structure, the clarifying questions, the calm narration, transfer directly regardless of the exact problem. Schedule one or two once your technical prep is underway, treat the feedback as data, and fix the delivery habits well before the interview that counts. Verdict: Best for rehearsing delivery and nerves; DE-specific matches can be hit-or-miss. How to Actually Use This List You do not need all ten. A focused plan beats a scattered one: Build the foundation. Skim CTCI for interview mechanics and start DDIA for concepts.Do the reps where it counts. Spend the bulk of your time on hands-on, DE-shaped practice that maps directly onto what you will be asked (see #1).Patch specific gaps. Use LeetCode for the algorithm screen, SQLZoo or the Mode tutorial to shore up SQL, and a mock interview or two to rehearse out loud. The candidates who get offers are not the ones who consumed the most content. They are the ones who practiced the actual job. Pick the resources that put you closest to it, start today, and write more queries than you read. Good luck with your loop.

By Rahul Han

Replacing Direct Storage URLs With a Media Proxy at Scale

The first enterprise client to use our automated reporting feature filed an escalation. Every image in their weekly email was a broken red X, every single one. No thumbnails rendered, so the report was effectively unusable. There was no bug. The code did exactly what I'd designed it to do months earlier, which was the problem. I had stored raw third-party storage URLs in our database and passed them straight to the frontend. Image tags, video tags, email templates — all of them pointed at someone else's domain. It worked in browsers. Then it hit a corporate email client with a strict image-domain allowlist. Our third-party hosts looked like tracking domains, so the client blocked every image. That was the first failure mode. The second showed up when an external media provider purged a batch of old assets from its CDN and thousands of images in our app broke overnight. The third came from a security review that flagged our public bucket URLs. We were running a multi-tenant product where customer data isolation was a hard requirement, and a public bucket clearly failed that test. The scale mattered. The serving layer I owned had grown past a few hundred million external media files, north of a hundred terabytes, with hundreds of thousands of new files landing daily. At that volume, there's no such thing as an edge case. A 0.01% breakage rate sounds like rounding error until you do the math and realize it's tens of thousands of broken images sitting in front of paying customers. So I rebuilt it. Why Storing Raw Media URLs Fails Here is why the store-the-URL, serve-the-URL pattern breaks down in production. You Don't Control Access Point your frontend at a third-party URL or a raw bucket path, and you've given up your access control layer. Your options are either a public bucket, which gives up access control entirely, or short-lived signed URLs (ref) that expire on a schedule that now becomes part of your application contract and breaks cached pages, bookmarks, and emailed links. Email Clients Silently Destroy Your Images This failure mode is easy to miss if you test in browsers and not in corporate mail clients. Corporate email clients keep allowlists of trusted image domains. Embed report images hosted on <storage.thirdparty.com> and the client blocks them (ref) silently. No HTTP error, no bounce, nothing in your logs. The recipient sees blank squares, and you see a perfectly healthy dashboard. I burned two days on this. Two days convinced me it was a rendering bug because the images loaded fine in every browser I tried. The mail client was the one saying no. Once I traced it to domain allowlists, the requirement changed: the client had to request media from our domain, not a storage host. Dead Links are Someone Else's Decision Storing a third-party URL as your canonical reference means trusting an external host to keep that exact path alive forever. They won't. Ad platforms rotate assets on their own schedule. A CDN vendor will deprecate an endpoint mid-migration and storage providers sunset whole APIs on their own timeline, taking your stored paths down with them. Each of those events 404s your stored URL with zero warning and zero fallback. You can't monitor hundreds of millions of external URLs for availability. At production scale, the stable option is to ingest the bytes and stop treating third-party URLs as durable identifiers. The Fix: Make It a Routing Problem Here's the reframe that unlocked the design. We had been treating media serving as a storage problem. It was really a routing problem. We put an API gateway at the edge, a custom proxy service behind it, and kept storage details out of every client. The client never sees a bucket name, a storage path, or a signed URL. The client knows one thing: a UUID-backed endpoint on our own domain. https://api.ourdomain.com/media?action=display&id=<uuid> A UUID request resolves to streamed bytes. The client never sees internal storage topology. The proxy service authenticates the request, resolves the UUID against a distributed relational database to get the internal blob ID, hands the blob ID and headers back to the file service, which fetches and streams the bytes. That file service reads customer blobs with delegated privilege tokens which are scoped and short-lived. Now the email client sees our domain, not some storage host it doesn't trust. Access control lives in one place. And when storage changes later, we swap it behind the proxy instead of rewriting frontend URLs or breaking customer integrations. The Metadata Layer Routing needs a fast lookup table. Every asset lives as a row in a distributed relational database, picked for strong consistency and low-latency reads across regions. These lookups sit in the hot path of every page render and every report, so this choice matters. SQL CREATE TABLE MediaAsset ( media_uuid VARCHAR(36) NOT NULL, tenant_id VARCHAR(50) NOT NULL, source_system VARCHAR(50), media_blob_id VARCHAR(255) NOT NULL, thumbnail_blob_id VARCHAR(255), -- preview JPEG for videos media_format VARCHAR(20), -- e.g., 'IMAGE' or 'VIDEO' created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (media_uuid) ); The proxy queries by UUID, pulls media_blob_id and media_format, and hands the blob reference to the file service with the scoped token. The UUID is the only identifier the outside world touches. Bucket, region, and replication tier all stay private. One deliberate call: The primary key is a high-entropy UUID, not a time-sortable one. In a single-node database with a clustered B-tree index, this would be a mistake; high-entropy keys fragment the index, and you'd reach for something time-sortable like UUIDv7 or ULID instead. In a range-sharded distributed database, the logic flips. Time-ordered keys pile every insert onto the tail shard and create a write hotspot. A high-entropy UUID spreads inserts across the keyspace, which is what lets parallel pipeline workers write without piling onto one shard. The section below shows how the UUIDs are computed. The Ingestion Pipeline: How the Bytes Get There So far we’ve only talked about the serving path. The other half of the system is the ingestion pipeline that pulls media in, stores the bytes, and generates the thumbnail data the next section depends on. When an upstream system surfaces a new asset, say an ad platform publishes an image or a partner API pushes a video reference, a task lands in a queue. Distributed workers pull from it. Per asset, a worker: Fetches the raw bytes from the external URL while it's still live.Detects the format (image vs. video, MIME type, resolution) from file headers.Writes the bytes to internal object storage and gets back a 'blob_id'.Inserts the metadata row: UUID, blob ID, detected format, and thumbnail_blob_id = null if thumbnail generation has not happened yet. Then an async enrichment pass, only when needed: 5. Generates derived assets. For video, that means extracting a preview frame into a separate JPEG blob, the thumbnail_blob_id. Pro tip: don't grab the literal first frame; it's a black fade-in on most videos. Sample from around the one-second mark instead. 6. Updates the metadata row with the thumbnail_blob_id. The UUID is derived deterministically by hashing three inputs together: the tenant, the source system, and the external asset ID. That gives us two properties at once. First, the same upstream asset always maps to the same UUID no matter how many times its task message gets delivered, which is what makes idempotency work. Second, the hash output is high-entropy enough to spread inserts across the keyspace instead of piling them onto one shard. Folding the tenant into the hash means two customers ingesting the same upstream asset get different UUIDs and separate rows, which is what allows for tenant-scoped lookup in the routing logic. In practice, the same external asset shows up in multiple task messages all the time. Retries after transient failures, duplicate events from upstream. The worker computes the UUID, checks whether the row already exists, and skips if it does. This keeps the blob store free of duplicate copies without distributed locking. Parallel workers can insert metadata rows concurrently with no central sequence to fight over. The pipeline writes, the proxy reads, and the UUID is the contract between them. I got one thing wrong in the first version: fetch-and-store and thumbnail generation ran in the same worker, so one slow video frame extraction would block the queue behind it and delay plain image ingestion. Splitting that into two stages: a fast store pass and an async enrichment pass cuts image ingest latency by more than half. That split opens a window where an asset is servable, but its thumbnail_blob_id is still null because enrichment has not caught up yet. The proxy handles that state by returning a neutral placeholder image instead of a 404. Browsers render a placeholder, and the real thumbnail shows up on the next request after enrichment lands. What the Proxy Enabled in the Serving Layer Once the proxy was in place, several frontend workarounds turned into normal server-side code paths. The underlying problem was the serving path. We just hadn't isolated one before. Video Thumbnails Our app shows dense grids of assets, lots of them videos. Loading full video files to paint a preview grid is wasteful, and the client-side workarounds (lazy loading, placeholders, canvas frame extraction in the browser) all fell over at our data volume. On a 50-video grid, preview rendering was visibly slow. With the proxy in the middle, one 'action' parameter was all we needed. The frontend asks for action=thumbnail, the proxy service skips the video entirely and returns that pre-generated preview JPEG the ingest pipeline already stored. It becomes the <video> poster attribute. Grid paints instantly, video loads on play. Before this, frame extraction happened in the user path, over and over for the same video. Now it happens once at ingest and every viewer reuses the result. With repeated views at this scale, moving frame extraction off the request path was not a micro-optimization. Bulk Downloads Users often needed dozens of assets at once, sometimes hundreds for offline review or handoff. Before the proxy, the frontend had to fan out into many concurrent downloads. Browsers started warning, connections saturated, and individual files timed out without a useful error. After the proxy was in place, bulk download became a normal code path instead of a frontend juggling act. The client sends multiple UUIDs on the download action, the proxy resolves them, and the file service streams the blobs into a single zip archive. It writes each blob into the archive as it arrives rather than buffering the whole thing in memory, so memory stays flat whether the user asked for five files or five hundred. Two costs come with streaming, and you should know you're paying them. First, there is no Content-Length header up front (MDN on the Content-Length header), so the browser can't show a reliable download progress bar. Second, since the 200 status goes out when the stream opens, a blob fetch failing mid-stream can't change the response code. The connection just drops, and the client sees a truncated zip. That was an intentional trade-off: worse progress signaling in exchange for predictable memory usage. It makes sense when large downloads are common, and memory pressure matters more than perfect browser UX. One endpoint, three behaviors, picked by the action parameter. Simplified routing logic: Python def handle_media_request(request): action = request.query_params.get('action') uuids = request.query_params.get('id').split(',') uuids = list(dict.fromkeys(uuids)) # dedupe, preserve request order # ONE batched lookup, scoped to the caller's tenant. The WHERE clause # filters on tenant_id, so a UUID owned by another tenant simply isn't # in the result set. This is the authorization check. rows = db.get_media_metadata_batch(uuids, tenant_id=request.user.tenant_id) # Any requested UUID we didn't get back is either unknown or cross-tenant. if len(rows) != len(uuids): return NotFound() by_uuid = {row.media_uuid: row for row in rows} # Token scoped to exactly the blobs we're about to serve, not the whole # user. Short-lived, travels with every file service call below. blob_scope = [r.media_blob_id for r in rows] + \ [r.thumbnail_blob_id for r in rows if r.thumbnail_blob_id] auth_token = generate_delegated_access_token(request.user, blob_scope) if action == 'display' and len(uuids) == 1: meta = by_uuid[uuids[0]] return FileResponse( blob_id=meta.media_blob_id, auth_token=auth_token, headers={'Content-Type': get_mime_type(meta.media_format)} ) elif action == 'thumbnail' and len(uuids) == 1: meta = by_uuid[uuids[0]] if meta.thumbnail_blob_id is None: # Enrichment pass hasn't landed yet, serve a placeholder return FileResponse( blob_id=PLACEHOLDER_BLOB_ID, auth_token=auth_token, headers={'Content-Type': 'image/jpeg'} ) return FileResponse( blob_id=meta.thumbnail_blob_id, auth_token=auth_token, headers={'Content-Type': 'image/jpeg'} ) elif action == 'download': if len(uuids) == 1: meta = by_uuid[uuids[0]] return FileResponse( blob_id=meta.media_blob_id, auth_token=auth_token, headers={ 'Content-Disposition': f'attachment; filename={uuids[0]}' } ) else: # File service streams these blobs into one zip on the fly blob_ids = [by_uuid[u].media_blob_id for u in uuids] return ZipArchiveResponse( blob_ids=blob_ids, auth_token=auth_token, headers={'Content-Type': 'application/zip'} ) return BadRequest() # unknown action, or display/thumbnail with >1 id Production Notes The proxy has been running in production for over a year. Broken email images, dead links, and access control gap issues are fully resolved. A few things I'd flag for anyone building something similar: Cache the UUID-to-blob mapping. The media_uuid --> media_blob_id mapping is immutable, and a 60-second in-memory TTL gave us a 95%+ hit rate. Why only 60 seconds for immutable data? Because immutable isn't permanent. Assets get deleted for takedowns and customer offboarding, and the TTL bounds how long a deleted asset can still be served from cache. I skipped caching initially and paid for it with tail latency spikes once traffic grew.Separate your ingest stages. Fetching bytes and generating thumbnails in the same worker means one slow video frame extraction blocks the whole queue. Do the fast store step first, then generate thumbnails asynchronously. If enrichment is still catching up, serve a placeholder instead of failing the request.Size your connection pools before you need to. The proxy holds connections to both the database and object storage, and under sustained load, the default pool sizes ran out. Requests queued, timeouts triggered retries, and the retries added more pressure to the same bottleneck. It is a classic incident loop that never shows up in design docs.If you expect media volume to grow, put the proxy in before direct URLs spread through every client. The extra hop cost us a few milliseconds, but it gave us one place to enforce auth, serve thumbnails, package downloads, and hide storage migrations. Retrofitting that after a customer incident is far more expensive.

By Deepak Gupta

Azure Databricks vs Microsoft Fabric: An Honest Guide to When to Use What

If you're building a data platform on Azure in 2026, you're going to be asked this question: Azure Databricks or Microsoft Fabric? Both run on Delta Lake, both integrate with ADLS Gen2, both have Spark, and both promise to be your unified data platform. The overlap is real, and the marketing doesn't help. This post is an honest breakdown of where each genuinely excels, where they overlap, and how to decide without getting lost in feature comparison tables. Architecture Comparison Decision Flow Detailed Capability Comparison CapabilityAzure DatabricksMicrosoft FabricWinnerSpark engineFull Spark, Photon, tunableSpark via Notebooks, less tunableDatabricksDelta LakeNative, full controlVia OneLake (Delta Parquet)TieMLflow / MLOpsNative, full MLflow stackBasic experiment trackingDatabricksModel servingDatabricks Model ServingAzure ML integrationDatabricksPower BI integrationDirectQuery via SQL WarehouseDirect Lake (zero-copy, faster)FabricSQL analyticsServerless SQL Warehouse + PhotonSQL Analytics EndpointTieData pipelinesDelta Live Tables, WorkflowsData Factory pipelines (mature)TieReal-time intelligenceSpark Streaming + KafkaEventstream + KQL DatabaseFabricSetup complexityMedium-highLow (SaaS)FabricFine-grained governanceUnity Catalog (mature)Purview integration (growing)DatabricksCost modelDBU + VMFabric capacity unitsComparableOpen format portabilityHigh (standard Delta/Parquet)Medium (OneLake but some lock-in)Databricks Step 1 — Reading Data from Fabric OneLake in Azure Databricks The good news: Fabric and Databricks can share data via OneLake, which speaks Delta format. You don't have to pick one and abandon the other. Python # Azure Databricks reading from Microsoft Fabric OneLake # OneLake exposes an ABFS-compatible endpoint # Authenticate using the workspace's Managed Identity or Service Principal tenant_id = dbutils.secrets.get("kv-scope", "sp-tenant-id") client_id = dbutils.secrets.get("kv-scope", "sp-client-id") client_secret = dbutils.secrets.get("kv-scope", "sp-client-secret") # OneLake uses the same ABFS protocol as ADLS Gen2 fabric_workspace_id = "your-fabric-workspace-guid" lakehouse_name = "your-lakehouse-name" onelake_host = "onelake.dfs.fabric.microsoft.com" spark.conf.set(f"fs.azure.account.auth.type.{onelake_host}", "OAuth") spark.conf.set(f"fs.azure.account.oauth.provider.type.{onelake_host}", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider") spark.conf.set(f"fs.azure.account.oauth2.client.id.{onelake_host}", client_id) spark.conf.set(f"fs.azure.account.oauth2.client.secret.{onelake_host}", client_secret) spark.conf.set(f"fs.azure.account.oauth2.client.endpoint.{onelake_host}", f"https://login.microsoftonline.com/{tenant_id}/oauth2/token") # Read a Delta table from Fabric Lakehouse fabric_path = f"abfss://{fabric_workspace_id}@{onelake_host}/{lakehouse_name}.Lakehouse/Tables/sales_gold" fabric_df = spark.read.format("delta").load(fabric_path) print(f"Rows from Fabric Lakehouse: {fabric_df.count()}") fabric_df.show(5) Step 2 — Writing Databricks Results Back to OneLake Run heavy ML feature engineering in Databricks, write results back to OneLake so Fabric Power BI can consume them via Direct Lake — zero-copy, sub-second dashboard refresh. Python from pyspark.sql.functions import current_timestamp, lit # Run your Databricks feature engineering / ML inference here result_df = spark.table("production.gold.churn_predictions") \ .withColumn("_computed_at", current_timestamp()) \ .withColumn("_source", lit("databricks-inference-job")) # Write back to Fabric OneLake as Delta output_path = f"abfss://{fabric_workspace_id}@{onelake_host}/{lakehouse_name}.Lakehouse/Tables/churn_predictions" result_df.write \ .format("delta") \ .mode("overwrite") \ .option("overwriteSchema", "true") \ .save(output_path) print(f"Written {result_df.count()} rows to Fabric OneLake.") print("Power BI Direct Lake will pick up changes automatically.") Step 3 — When to Use Fabric Notebooks vs Databricks Notebooks Not everything needs Databricks. Fabric Notebooks are good enough for lighter data prep that feeds Power BI reports. Python # This kind of transformation is fine in Fabric Notebooks # Use Fabric when: output goes directly to Power BI, team is analytics-focused, # no MLflow tracking needed, data volume < 100GB # Fabric Notebook (PySpark — same syntax as Databricks) from pyspark.sql.functions import col, sum as _sum, date_trunc df = spark.read.format("delta").load("Tables/sales_silver") summary = df \ .withColumn("month", date_trunc("month", col("sale_ts"))) \ .groupBy("month", "region", "product_category") \ .agg(_sum("revenue").alias("monthly_revenue")) \ .orderBy("month", "region") # Write to Lakehouse table — Power BI picks it up via Direct Lake summary.write.format("delta").mode("overwrite").saveAsTable("monthly_revenue_summary") # Use Databricks when: MLflow tracking needed, complex ML pipeline, # Unity Catalog governance required, data volume > 1TB, streaming workloads When to Use Which: Decision Framework Python # Use this as a mental checklist when deciding DATABRICKS_STRENGTHS = [ "Complex ML pipelines with MLflow experiment tracking", "Production model serving with A/B testing", "Fine-grained governance via Unity Catalog (row/column security)", "Spark Structured Streaming with Kafka / Event Hub", "Very large scale ETL (multi-TB, complex joins)", "Open-source tool integrations (dbt, Great Expectations, etc.)", "Multi-cloud or portability requirements", ] FABRIC_STRENGTHS = [ "Power BI as the primary consumption layer (Direct Lake = fastest)", "Analytics-focused teams without deep Spark expertise", "Microsoft 365 integration (Teams, SharePoint data sources)", "Real-time dashboards via Eventstream + KQL", "Fabric Data Factory for straightforward ELT pipelines", "Lower operational overhead — fully SaaS managed", "Already licensed via Microsoft 365 E5 / Fabric capacity", ] BOTH_TOGETHER = [ "Heavy ML/MLOps in Databricks, results published to OneLake for Power BI", "Fabric Data Factory for ingestion, Databricks for complex transformation", "Unity Catalog governing Databricks tables, Fabric consuming via shortcuts", ] Things to Watch in Production OneLake shortcuts are the integration bridge. Fabric Lakehouses support shortcuts that point to external Delta tables in ADLS Gen2 — the same storage Databricks writes to. This means Databricks writes once and Fabric reads without data movement. Set up shortcuts rather than copying data between platforms. Unity Catalog doesn't govern Fabric. Your row-level security and column masks in Unity Catalog do not apply when Fabric reads the same underlying Delta files directly. If governance is critical, either run everything through Databricks or replicate governance rules in Fabric's permission model. Fabric capacity units and Databricks DBUs are both usage-based but measure differently. Don't try to compare them directly. Run the same workload in both and compare wall-clock time and cost on your actual data sizes. Fabric ML is improving fast but isn't MLflow. As of early 2026, Fabric ML experiment tracking is functional but doesn't have the depth of MLflow's model registry, artifact storage, or model serving. If MLOps maturity matters, stay on Databricks for ML. Wrapping Up The honest answer is: most mature Azure data platforms in 2026 use both. Azure Databricks for ML, complex transformations, governance, and streaming. Microsoft Fabric for Power BI-first analytics, simpler pipelines, and teams that don't need the full Databricks stack. OneLake shortcuts and the shared Delta format make them composable rather than competitive. Pick based on your primary consumer: if it's Power BI dashboards, start with Fabric. If it's ML models and data products, start with Databricks. When you need both, they integrate cleanly. References Microsoft Fabric DocumentationOneLake — The OneDrive for DataFabric Lakehouse vs Azure DatabricksDirect Lake in Power BIOneLake ShortcutsAzure Databricks and Microsoft Fabric IntegrationUnity Catalog vs Fabric Data GovernanceFabric Eventstream — Real-Time Intelligence

By Jubin Abhishek Soni

CORE

A Step-by-Step Guide to Implementing Columnar Tables in SQL Server

Columnar storage was introduced in SQL Server 2016 as part of the SQL Server 2016 In-Memory OLTP feature. It is specifically designed for data warehousing and analytical workloads, where large amounts of data need to be scanned, aggregated, or analyzed efficiently. Columnar storage stores data in a column-wise format rather than the traditional row-wise storage, offering significant performance benefits for read-heavy operations such as reporting and analytics. Key Benefits of Columnar Storage Faster read performance: Optimized for analytics where only a few columns are needed in a query. Compression: Since column data is homogeneous, it achieves high compression rates, saving storage space. Improved query performance: Aggregating or scanning specific columns is much faster in a columnar format, especially with large datasets. Setting Up Columnar Tables in SQL Server SQL Server implements columnar storage through the Columnstore Index. The Columnstore Index is a special kind of index used in large data tables where the data is stored in columns rather than rows. The clustered columnstore index (CCI) is the preferred method when creating columnar tables. Step 1: Create a Sample Table Let's start by creating a table with a large number of rows, which we will populate with random data to demonstrate the difference between row-store and column-store formats. SQL -- Creating a traditional Rowstore Table CREATE TABLE SalesData_RowStore ( SalesOrderID INT, ProductID INT, Quantity INT, SalesAmount DECIMAL(18, 2), OrderDate DATE ); Step 2: Insert Data Into Rowstore Table For the sake of performance demonstration, we will generate a large set of random data. MS SQL -- Generate a large set of random data for Rowstore Table DECLARE @Counter INT = 0; WHILE @Counter < 1000000 BEGIN INSERT INTO SalesData_RowStore (SalesOrderID, ProductID, Quantity, SalesAmount, OrderDate) VALUES (FLOOR(RAND() * 1000) + 1, FLOOR(RAND() * 100) + 1, FLOOR(RAND() * 100) + 1, FLOOR(RAND() * 500) + 1, DATEADD(DAY, FLOOR(RAND() * 365) + 1, GETDATE())); SET @Counter = @Counter + 1; END Implementing Columnstore Index (Columnar Table) Step 1: Create a Columnstore Table Now, let's create a table with a clustered columnstore index (CCI). This index allows SQL Server to store the data in a columnar format. MS SQL -- Creating a Columnstore Table with Clustered Columnstore Index CREATE TABLE SalesData_ColumnStore ( SalesOrderID INT, ProductID INT, Quantity INT, SalesAmount DECIMAL(18, 2), OrderDate DATE ); MS SQL CREATE CLUSTERED COLUMNSTORE INDEX CCI_SalesData ON SalesData_ColumnStore; Step 2: Insert the Same Data Into the Columnstore Table You can insert the same large dataset into the columnar table in the same way. SQL -- Insert data into Columnstore Table DECLARE @Counter INT = 0; WHILE @Counter < 1000000 BEGIN INSERT INTO SalesData_ColumnStore (SalesOrderID, ProductID, Quantity, SalesAmount, OrderDate) VALUES (FLOOR(RAND() * 1000) + 1, FLOOR(RAND() * 100) + 1, FLOOR(RAND() * 100) + 1, FLOOR(RAND() * 500) + 1, DATEADD(DAY, FLOOR(RAND() * 365) + 1, GETDATE())); SET @Counter = @Counter + 1; END Query Performance Without Columnar Index Let's execute a typical query that aggregates data by ProductID and OrderDate. This will involve scanning through a large amount of data in the rowstore table. MS SQL -- Query on Rowstore Table SELECT ProductID, SUM(SalesAmount) AS TotalSales FROM SalesData_RowStore WHERE OrderDate > '2023-01-01' GROUP BY ProductID; Expected Outcome The query will scan all the rows in the table. Rowstore tables are not optimized for this type of query, and the performance might degrade with large datasets due to the need to read each row. Query Performance With Columnar Index Let's run the same query on the columnar table using a Clustered Columnstore Index. MS SQL -- Query on Columnstore Table SELECT ProductID, SUM(SalesAmount) AS TotalSales FROM SalesData_ColumnStore WHERE OrderDate > '2023-01-01' GROUP BY ProductID; Expected Outcome The columnar index stores the data by columns, and SQL Server can read only the relevant columns for the query (i.e., ProductID and SalesAmount). Columnstore indexes are highly optimized for these types of queries, resulting in much faster query execution time. Comparing the Performance of Both Scenarios To compare the performance of the two scenarios, we will execute both queries and check the execution plan and query duration. Step 1: Query Execution Plan Without Columnstore You can use the following query to view the execution plan for the rowstore table. MS SQL -- Displaying Execution Plan for Rowstore Table SET STATISTICS IO ON; SET STATISTICS TIME ON; SELECT ProductID, SUM(SalesAmount) AS TotalSales FROM SalesData_RowStore WHERE OrderDate > '2023-01-01' GROUP BY ProductID; SET STATISTICS IO OFF; SET STATISTICS TIME OFF; This will provide information on: Logical reads: The number of data pages read from diskCPU time: How much CPU time was consumedElapsed time: The total time taken to execute the query Step 2: Query Execution Plan With Columnstore Now, execute the same for the columnstore table. MS SQL -- Displaying Execution Plan for Columnstore Table SET STATISTICS IO ON; SET STATISTICS TIME ON; SELECT ProductID, SUM(SalesAmount) AS TotalSales FROM SalesData_ColumnStore WHERE OrderDate > '2023-01-01' GROUP BY ProductID; SET STATISTICS IO OFF; SET STATISTICS TIME OFF; In the execution plan for the columnstore table, SQL Server will typically show fewer logical reads and significantly lower CPU time, as it only scans the necessary columns. Performance Improvements in Columnar Tables Scenario 1: Data Compression Columnar storage achieves higher compression rates because data is stored in homogeneous chunks, which makes it more efficient in terms of storage. Compression reduces disk I/O during query execution. Scenario 2: Selective Column Scanning When querying only a few columns, columnar storage avoids scanning the entire row. In contrast, rowstore requires scanning all columns in every row, even if only a subset is required for the query. Conclusion In this example, we demonstrated how implementing columnstore indexes in SQL Server can significantly improve query performance, especially for analytics and aggregation queries on large datasets. The comparison showed that columnar storage excels in reducing query times by optimizing disk I/O, leveraging data compression, and selectively reading only the necessary columns. As a result, columnstore indexing is a great choice for data warehousing or any scenario where read performance for large datasets is critical.

By arvind toorpu

CORE

HTTP QUERY in Java: The Missing Method for Complex REST API Searches

HTTP methods in REST API design are more than technical details; they communicate intent between clients and servers. A GET request instructs the server to retrieve a resource. A POST request typically indicates that data should be processed, often creating a new resource. PUT indicates replacement or update, while DELETE signals removal. These methods are well-established and fundamental to the Web. Despite this, API design has long faced a notable gap. Challenges arise when a client needs to retrieve data using queries too complex for a URL. Filters such as destination, price range, availability, category, user preferences, pagination, sorting, and business rules can be added as query parameters, but this often results in lengthy, hard-to-read URLs that are difficult to maintain and may not be suitable for sensitive or structured data. For years, the common workaround was to use POST for search operations: HTTP POST /trips/search Content-Type: application/json However, this approach does not align with HTTP semantics. Searching is typically a safe operation that does not alter server state and is often idempotent, producing the same result if the underlying data remains unchanged. POST does not clearly convey this intent, and can complicate caching and retries, and the API documentation is less precise. The new HTTP QUERY method addresses this specific need. The QUERY method provides a dedicated way to send structured request content while indicating that the operation is safe and idempotent. It functions similarly to GET but allows the client to include a request body, as with POST. According to RFC 10008, a QUERY request asks the target resource to process the enclosed content safely and idempotently, then return the result. This matters because modern APIs are no longer limited to. This is important because modern APIs often require more than simple resource retrieval. Use cases such as search screens, dashboards, reporting APIs, recommendation engines, GraphQL-like endpoints, analytics filters, and domain-specific query languages demand more expressive input than URLs can provide, yet do not represent state-changing operations. cation protocol, not merely as a transport tunnel. The word “method” in HTTP is important: it defines the request's semantic purpose. QUERY continues that tradition by giving read-oriented complex operations their own explicit place in the protocol. This article will examine the purpose of QUERY, its differences from GET and POST, appropriate use cases, and its potential to enhance modern Java REST API design. Building the Sample Project With the importance of the HTTP QUERY method established, let’s transition from concept to implementation. This sample uses a travel agency domain. The goal is to expose a search operation that allows clients to filter travel offers by city, travel type, and price range. In such cases, a traditional GET URL can become cumbersome, while using POST is not semantically appropriate. The project uses Helidon 4.5.0, generated from the official Helidon Starter. Helidon MP provides a MicroProfile/Jakarta-oriented programming model suitable for this example. After generating the project, configure the dependencies. This sample uses Eclipse JNoSQL 1.2.0-M1 to demonstrate Jakarta NoSQL and Jakarta Data integration. XML <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <project.report.sourceEncoding>UTF-8</project.report.sourceEncoding> <maven.compiler.release>21</maven.compiler.release> <jnosql.version>1.2.0-M1</jnosql.version> </properties> <dependencies> <dependency> <groupId>org.eclipse.jnosql.databases</groupId> <artifactId>jnosql-oracle-nosql</artifactId> <version>${jnosql.version}</version> </dependency> <dependency> <groupId>org.eclipse.jnosql.metamodel</groupId> <artifactId>mapping-metamodel-processor</artifactId> <version>${jnosql.version}</version> <scope>provided</scope> </dependency> </dependencies> Two key dependencies are included. The first dependency enables Eclipse JNoSQL integration with Oracle NoSQL, which supports key-value and document-oriented models. This example uses the document model, as the travel entity maps naturally to a document structure. The second dependency enables the metamodel processor, which generates the _Travel class. This allows us to use type-safe Jakarta Data restrictions, such as _Travel.city.equalTo(city), for fluent and dynamic queries. Creating the Domain Entity With dependencies configured, we can define the domain model. The Travel entity represents each travel offer, including an identifier, destination city, travel type, and price. Jakarta NoSQL supports entities as regular classes or Java records. For this small, immutable sample, a Java record is appropriate. Java import jakarta.nosql.Column; import jakarta.nosql.Entity; import jakarta.nosql.Id; import java.math.BigDecimal; import java.util.UUID; @Entity public record Travel( @Id UUID id, @Column String city, @Column TravelType type, @Column BigDecimal price) { } These annotations are similar to those in Jakarta Persistence. While the vocabulary differs, the intent remains familiar. Java import jakarta.nosql.Column; import jakarta.nosql.Entity; import jakarta.nosql.Id; @Entity marks the record as persistent. @Id specifies the primary identifier. @Column maps fields to the NoSQL database. Now we define the travel category: Java public enum TravelType { BUSINESS, LEISURE } This approach provides a compact and expressive domain model for the remainder of the article. Creating the Repository With Jakarta Data The next step is to create the bridge between Java and the database. We use Jakarta Data, which offers a repository-oriented programming model compatible with various persistence technologies. In this sample, the repository uses Eclipse JNoSQL and Oracle NoSQL, while the programming model remains domain-focused. Java import jakarta.data.repository.BasicRepository; import jakarta.data.repository.Find; import jakarta.data.repository.Repository; import jakarta.data.restrict.Restriction; import java.util.List; import java.util.UUID; @Repository public interface TravelRepository extends BasicRepository<Travel, UUID> { @Find List<Travel> query(Restriction<Travel> restriction); default boolean isEmpty() { return countBy() == 0; } long countBy(); } The repository extends BasicRepository<Travel, UUID>, which gives us common persistence operations such as saving entities and retrieving data. The key method for this article is: Java @Find List<Travel> query(Restriction<Travel> restriction); A Restriction<Travel> represents a dynamic query condition, which aligns well with the HTTP QUERY scenario. Clients can send various combinations of filters, such as city, price, travel type, or all of them. Instead of creating separate repository methods for each combination, we build queries dynamically. The isEmpty() method is a convenience for loading initial data at application startup: Java default boolean isEmpty() { return countBy() == 0; } long countBy(); Creating the Filter Request DTO Next, create the object that represents the request. Here, the HTTP QUERY method is useful. Instead of encoding each filter in the URL, we send a structured request body. The Java representation is as follows: Java import expert.os.demos.travel.infrastructure.FieldVisibilityStrategy; import jakarta.json.bind.annotation.JsonbVisibility; import java.math.BigDecimal; import java.util.Optional; @JsonbVisibility(value = FieldVisibilityStrategy.class) public class TravelFilterRequest { private String city; private TravelType type; private BigDecimal minPrice; private BigDecimal maxPrice; public Optional<String> city() { return Optional.ofNullable(city); } public Optional<TravelType> type() { return Optional.ofNullable(type); } public Optional<BigDecimal> minPrice() { return Optional.ofNullable(minPrice); } public Optional<BigDecimal> maxPrice() { return Optional.ofNullable(maxPrice); } @Override public String toString() { return "TravelRequest{" + "city='" + city + '\'' + ", type=" + type + ", minPrice=" + minPrice + ", maxPrice=" + maxPrice + '}'; } } A key design choice is that each accessor returns an Optional. This approach makes the filtering logic explicit. Fields may or may not be present in the request, so the service applies each restriction only when a value exists, avoiding null checks. The @JsonbVisibility annotation customizes how JSON-B accesses fields. Here, the DTO keeps fields private and exposes domain-friendly accessor methods such as city(), type(), minPrice(), and maxPrice(). Creating the Travel Service Now, we can create the service layer. In this sample, the service has two responsibilities: First, it loads initial travel data if the repository is empty. Second, it converts incoming filter requests into Jakarta Data restrictions. Java import jakarta.annotation.PostConstruct; import jakarta.data.restrict.Restrict; import jakarta.data.restrict.Restriction; import jakarta.enterprise.context.ApplicationScoped; import jakarta.inject.Inject; import org.eclipse.jnosql.mapping.Database; import org.eclipse.jnosql.mapping.DatabaseType; import java.util.ArrayList; import java.util.List; import java.util.UUID; import java.util.logging.Logger; @ApplicationScoped public class TravelService { private static final Logger LOGGER = Logger.getLogger(TravelService.class.getName()); private final TravelRepository travelRepository; @Inject public TravelService(@Database(DatabaseType.DOCUMENT) TravelRepository travelRepository) { this.travelRepository = travelRepository; } TravelService() { this.travelRepository = null; } @PostConstruct void load() { if (travelRepository.isEmpty()) { LOGGER.info("[TRAVEL SERVICE] Loading initial travel data..."); travelRepository.save(new Travel( UUID.randomUUID(), "New York", TravelType.BUSINESS, new java.math.BigDecimal("1500.00"))); travelRepository.save(new Travel( UUID.randomUUID(), "Paris", TravelType.LEISURE, new java.math.BigDecimal("2000.00"))); travelRepository.save(new Travel( UUID.randomUUID(), "Tokyo", TravelType.BUSINESS, new java.math.BigDecimal("3000.00"))); travelRepository.save(new Travel( UUID.randomUUID(), "Sydney", TravelType.LEISURE, new java.math.BigDecimal("1800.00"))); travelRepository.save(new Travel( UUID.randomUUID(), "Rome", TravelType.LEISURE, new java.math.BigDecimal("2500.00"))); } else { LOGGER.info("[TRAVEL SERVICE] Travel data already loaded."); } } public List<Travel> search(TravelFilterRequest filter) { LOGGER.info("[TRAVEL SERVICE] Searching for travels with filter: " + filter); if (filter == null) { return travelRepository.findAll().toList(); } List<Restriction<Travel>> restrictions = new ArrayList<>(); filter.city() .ifPresent(city -> restrictions.add(_Travel.city.equalTo(city))); filter.type() .ifPresent(type -> restrictions.add(_Travel.type.equalTo(type))); filter.minPrice() .ifPresent(minPrice -> restrictions.add(_Travel.price.greaterThanEqual(minPrice))); filter.maxPrice() .ifPresent(maxPrice -> restrictions.add(_Travel.price.lessThanEqual(maxPrice))); return travelRepository.query( Restrict.all(restrictions.toArray(new Restriction[0]))); } } Enabling the HTTP QUERY Method With JAX-RS The final step is to enable the HTTP QUERY method in the Java API. JAX-RS simplifies this process. The specification allows custom HTTP method annotations using @HttpMethod. Since QUERY is now an HTTP method, we can create a concise annotation and apply it directly in the resource class. Java import jakarta.ws.rs.HttpMethod; import java.lang.annotation.ElementType; import java.lang.annotation.Retention; import java.lang.annotation.RetentionPolicy; import java.lang.annotation.Target; @Target({ElementType.METHOD}) @Retention(RetentionPolicy.RUNTIME) @HttpMethod("QUERY") public @interface QUERY { } This annotation functions similarly to the built-in JAX-RS annotations such as @GET, @POST, @PUT, and @DELETE. The key line is: Java @HttpMethod("QUERY") This informs JAX-RS that methods annotated with @QUERY handle incoming HTTP requests using the QUERY method. At this stage, the API expresses the operation more clearly. POST is no longer used as a workaround for search. This endpoint now receives a query payload and returns a result without altering server state. Exposing the Travel Resource We can now expose the resource. Java package expert.os.demos.travel; import expert.os.demos.travel.infrastructure.QUERY; import jakarta.enterprise.context.ApplicationScoped; import jakarta.inject.Inject; import jakarta.ws.rs.Consumes; import jakarta.ws.rs.Path; import jakarta.ws.rs.Produces; import jakarta.ws.rs.core.MediaType; import java.util.List; import java.util.logging.Logger; @ApplicationScoped @Consumes(MediaType.APPLICATION_JSON) @Produces(MediaType.APPLICATION_JSON) @Path("/travels") public class TravelResource { private static final Logger LOGGER = Logger.getLogger(TravelResource.class.getName()); @Inject private TravelService travelService; @QUERY public List<Travel> search(TravelFilterRequest filter) { LOGGER.info("Searching travels with filter: " + filter); return travelService.search(filter); } } The resource is intentionally small. Its job is not to understand database details or build dynamic queries. It only receives the HTTP request, delegates the filter to the service, and returns the result. Java @QUERY public List<Travel> search(TravelFilterRequest filter) { LOGGER.info("Searching travels with filter: " + filter); return travelService.search(filter); } This method is the article's focal point. Executing the Request With the service running, test the endpoint using a client that supports custom HTTP methods. For example, in Postman: HTTP QUERY http://localhost:8081/travels Content-Type: application/json { "type": "BUSINESS", "maxPrice": 2800 } Alternatively, in Postman-style command form: Shell postman request QUERY 'http://localhost:8081/travels' \ --header 'Content-Type: application/json' \ --body '{"type": "BUSINESS", "maxPrice": 2800}' Given the initial data loaded by the service, this request retrieves business trips with a price of 2800 or less. The matching result should include New York: JSON [ { "id": "generated-uuid", "city": "New York", "type": "BUSINESS", "price": 1500.00 } ] Tokyo is also a business trip, but its price is 3000.00, so it does not match the maxPrice filter. This example demonstrates the value of QUERY: the client can send a structured request body, the server preserves read-oriented semantics, and the backend maps the request directly into a dynamic Jakarta Data query.

By Otavio Santana

CORE

Parquet vs Lance: How Storage Layout Changes the Read Path

Apache Parquet became the default format for analytical data because it matched the read path of analytical engines. Queries scanned large parts of a dataset, often across a small set of columns, and Parquet was built to support that efficiently. Row groups, column pages, and compression all work well when the goal is to maximize scan throughput. That model still fits a large part of analytics. But it starts to break down when queries read small subsets of data, especially when those reads are repeated. At that point, the cost is no longer dominated by scanning. It depends on how much data the reader must process before it can return the result. That is where comparing Parquet with Lance becomes useful; the difference is not just in file format, but in the read path itself. The Lance paper frames this problem well by focusing on how structural encoding affects random access and scan performance. Running the Examples Locally All of the examples below can run on a laptop. Install the dependencies with: pip install pandas pyarrow pylance numpy The Python package is pylance, but it is imported as lance. The official Lance docs and Python SDK docs are useful if you want to explore the API surface further. If you are using Homebrew Python on macOS and see an externally-managed-environment error, use a virtual environment instead: Shell python3 -m venv parquet-lance-demo source parquet-lance-demo/bin/activate pip install pandas pyarrow pylance numpy Where the Difference Starts Parquet and Lance are both columnar formats, but they are optimized around different kinds of access. Parquet is built around scan-heavy workloads. Data is typically written once, stored in larger chunks, and read sequentially. That design improves compression and makes it easier for analytical engines to process large volumes of data efficiently. Lance takes a different path. It is designed for workloads where queries may repeatedly touch small parts of a dataset, where latency matters more, and where similarity search is part of the data access path rather than an external system layered on top. This difference is easiest to understand with two concrete examples. The first is a selective filter. The second is vector similarity search. Three Read Paths to Keep in Mind The easiest way to compare Parquet and Lance is to start with the read path. A scan-oriented read path is the classic analytical case. The query reads a meaningful portion of a dataset, usually across a subset of columns. Parquet performs well here because the reader can process row groups and column pages efficiently. A selective read path behaves differently. The query may return only a few rows, but the reader still needs to identify where those rows live. If the format works mainly at chunk granularity, the system may read and decode more data than it returns. A vector-native read path is different again. The query is not asking whether a row satisfies a predicate like id < 100. It is asking which rows are closest to a query vector. That requires an index-aware retrieval path, not only a scan path. Use Case 1: Selective Reads Start with something familiar: a basic filter. WHERE id < 100 On a dataset with millions of rows, this returns almost nothing. The interesting part is not the result. The interesting part is how much work the system performs before it gets there. To make that visible, I used the same benchmark on both formats. The script below generates a dataset, writes it to Parquet and Lance, and then runs the same filter several times. Python import os import shutil import time import numpy as np import pandas as pd import pyarrow as pa import lance # Try 1M, 5M, 10M to observe scaling behavior N = 5_000_000 # Small result set -> stresses selective access vs scan FILTER_EXPR = "id < 100" # Run multiple times; use best to reduce noise RUNS = 5 def clean_outputs(): if os.path.exists("data.parquet"): os.remove("data.parquet") if os.path.exists("data.lance"): shutil.rmtree("data.lance") def time_it(name, fn, runs=RUNS): times = [] result = None for _ in range(runs): start = time.time() result = fn() times.append(time.time() - start) print(f"{name} times:", times) print(f"{name} best:", min(times)) return result def main(): clean_outputs() print(f"Generating dataset with {N} rows...") df = pd.DataFrame({ "id": np.arange(N), "value": np.random.rand(N) }) # Parquet write (scan-optimized) start = time.time() df.to_parquet("data.parquet") print("Parquet write time:", time.time() - start) # Lance write (Arrow-based) start = time.time() table = pa.Table.from_pandas(df) lance.write_dataset(table, "data.lance", mode="overwrite") print("Lance write time:", time.time() - start) dataset = lance.dataset("data.lance") # Selective filter: returns ~100 rows result_parquet = time_it( "Parquet filter", lambda: pd.read_parquet("data.parquet").query(FILTER_EXPR) ) result_lance = time_it( "Lance filter", lambda: dataset.to_table(filter=FILTER_EXPR).to_pandas() ) print("Parquet rows:", len(result_parquet)) print("Lance rows:", len(result_lance)) if __name__ == "__main__": main() Here is a sample run from my laptop using 5 million rows: Shell (parquet-lance-demo) hitarth@hitarth % python parquet_vs_lance.py Generating dataset with 5000000 rows... Parquet write time: 0.14815711975097656 Lance write time: 0.09784913063049316 Parquet filter times: [0.09580063819885254, 0.04515504837036133, 0.03702282905578613, 0.04055309295654297, 0.03741908073425293] Parquet filter best: 0.03702282905578613 Lance filter times: [0.0851907730102539, 0.012360095977783203, 0.009278059005737305, 0.008661031723022461, 0.007877826690673828] Lance filter best: 0.007877826690673828 Parquet rows: 100 Lance rows: 100 The first Lance run was slower than the rest, but repeated runs stabilized quickly. The same pattern showed up across larger dataset sizes as well. I also ran the benchmark at 1 million, 5 million, and 10 million rows. In all cases, the query returned 100 rows. These were the best-of-five timings: dataset sizeparquet filterlance filter 1,000,000 rows 0.033s 0.010s 5,000,000 rows 0.037s 0.008s 10,000,000 rows 0.073s 0.016s The numbers matter less than the pattern. The result size stays constant, but Parquet’s time increases with dataset size while Lance remains relatively stable after the initial read. That points directly to a difference in the read path. On a laptop, this difference shows up as milliseconds. In a production data lake, the same pattern can become more expensive. Extra chunks do not only mean extra CPU. They can also mean additional object-store reads, decompression work, memory materialization, and network latency. A selective query over Parquet may return a tiny result set, but still pay part of the cost of scanning and decoding larger units of data. That is the practical form of read amplification. A compact way to visualize it is this: Why Parquet Takes This Path Parquet is not just a file of columns. Internally, a file is organized into row groups; each row group contains one column chunk per column, and column chunks are divided into pages, as described in the Parquet concepts and file format documentation. Parquet metadata also helps the reader skip some work before decoding begins, which is one of the format’s core strengths. See the metadata documentation. I covered the Parquet scan path in more detail in my earlier DZone article, Understanding Parquet Scans, so I’ll keep this recap focused on what changes in the read path as the workload becomes more selective. Plain Text [ repetition levels ] [ definition levels ] [ values ] Those pages do not hold only values. They also hold structural information needed to reconstruct rows, especially when nested data is involved. This is also the lens used in the Lance paper. It argues that structural encoding, especially repetition, validity, and page layout, has a direct impact on random-access cost, read amplification, and decode overhead. When a reader evaluates a predicate over Parquet data, it first uses metadata to decide which row groups may be relevant. It can skip some work at that level, which is one of Parquet’s strengths. But once a row group has been selected, the reader still needs to read column pages, decode them, reconstruct the row structure, and only then evaluate the filter. That path is efficient when a query is scanning a large fraction of the dataset. It is less efficient when the result is tiny. The smallest useful unit of work is still a chunk. This is why selective queries can feel disproportionately expensive in Parquet. The format is doing exactly what it was designed to do. It is just optimized around chunk-level processing rather than lookup-oriented access. What Changes in Lance Lance changes that path earlier. Instead of treating most queries as scan-first, Lance uses dataset metadata and access structures to narrow the read before reconstruction begins. The official read and write guide is a good starting point for the dataset API used in the examples below. The reader can identify relevant fragments, read only the necessary data, and return results without paying the same chunk-level decode cost across the rest of the dataset. For selective reads, the practical effect is that the reader can identify relevant fragments and avoid decoding larger portions of the dataset. For vector workloads, the index becomes even more important because the query is not looking for an exact predicate match. It is looking for nearby vectors. That is the practical meaning behind the benchmark. The query returns 100 rows in both cases, but the amount of data processed before those rows are produced is different. This distinction becomes clearer as the dataset grows because Parquet’s work still tracks chunk boundaries while Lance’s work is tied more closely to the size of the result. Use Case 2: Vector Similarity Filtered reads are still part of traditional analytics. Vector search is a different kind of workload. A vector is just a list of numbers that represents something like text, an image, or a user. Instead of filtering by exact values, a system compares vectors and returns the nearest matches. A traditional query looks for rows that satisfy a predicate. A vector query looks for rows that are similar. In larger systems, vector search is usually implemented with an approximate nearest neighbor index, often called an ANN index. The index avoids comparing the query vector against every stored vector. Instead, it narrows the search to candidates that are likely to be close. This trades a small amount of exactness for much faster retrieval. In many data lake architectures, this index lives outside the analytical dataset. Vectors may be stored in Parquet, then copied into a separate vector database or indexing service. That creates another pipeline to maintain and another consistency problem to manage. This is why vector-native storage matters. The important change is not only that vectors can be stored. It is that retrieval becomes part of the dataset access path. That difference sounds abstract until you tie it to something familiar. This shows up in semantic search, recommendations, LLM retrieval, and image similarity. In each case, the system is not looking for an exact value. It is looking for nearby representations. The shape of the query changes from this: WHERE id = 42 → exact match to this: query → vector → nearest neighbors That changes what the storage layer needs to support. Here is the benchmark I used for the vector case: Python import os import shutil import time import numpy as np import pandas as pd import pyarrow as pa import lance # Try 1M, 5M, 10M to observe scaling behavior N = 5_000_000 # Small result set -> stresses selective access vs scan FILTER_EXPR = "id < 100" # Run multiple times; use best to reduce noise RUNS = 5 def clean_outputs(): if os.path.exists("data.parquet"): os.remove("data.parquet") if os.path.exists("data.lance"): shutil.rmtree("data.lance") def time_it(name, fn, runs=RUNS): times = [] result = None for _ in range(runs): start = time.time() result = fn() times.append(time.time() - start) print(f"{name} times:", times) print(f"{name} best:", min(times)) return result def main(): clean_outputs() print(f"Generating dataset with {N} rows...") df = pd.DataFrame({ "id": np.arange(N), "value": np.random.rand(N) }) # Parquet write (scan-optimized) start = time.time() df.to_parquet("data.parquet") print("Parquet write time:", time.time() - start) # Lance write (Arrow-based) start = time.time() table = pa.Table.from_pandas(df) lance.write_dataset(table, "data.lance", mode="overwrite") print("Lance write time:", time.time() - start) dataset = lance.dataset("data.lance") # Selective filter: returns ~100 rows result_parquet = time_it( "Parquet filter", lambda: pd.read_parquet("data.parquet").query(FILTER_EXPR) ) result_lance = time_it( "Lance filter", lambda: dataset.to_table(filter=FILTER_EXPR).to_pandas() ) print("Parquet rows:", len(result_parquet)) print("Lance rows:", len(result_lance)) if __name__ == "__main__": main() Here is a sample run from my laptop: Shell (parquet-lance-demo) hitarth@hitarth parquet_lance % python vector.py Generating 100000 vectors of dimension 128... Lance write time: 0.07955217361450195 Vector search time: 0.13872408866882324 id vector _distance 0 50506 [0.053919002, 0.36111426, 0.2145877, 0.9197419... 12.631166 1 41428 [0.17633885, 0.71251565, 0.072742924, 0.759959... 12.962575 2 3216 [0.06450701, 0.24716537, 0.41617322, 0.624773,... 13.008584 3 50216 [0.13460344, 0.9618073, 0.8334099, 0.56230646,... 13.097234 4 75019 [0.35094073, 0.11819457, 0.44928855, 0.0426102... 13.124901 This wrote 100,000 vectors of dimension 128 and returned the top 5 nearest vectors in about 139 ms. The exact number is less important than the query path itself: the search runs directly against the dataset and returns nearest neighbors with distances. Parquet can store the same vector column, but it does not provide a native nearest-neighbor query path. In practice, the workflow usually looks like this: Parquet → extract vectors → build index → query With Lance, the storage layer participates directly in the query: dataset → query directly That is not just a performance difference. It is a capability difference. Lance’s official documentation also exposes SDK-level APIs for working with datasets and vector-oriented workflows through the SDK docs. Lance expects vector columns as fixed-size arrays rather than generic variable-length Python lists. That lets the system reason about dimensionality during query execution. In other words, the structure of the stored data is part of making the query possible. Tradeoffs and Operational Considerations This comparison should not be read as "Lance replaces Parquet." The two formats are useful in different parts of a data platform. Parquet remains the safer default for broad analytical workloads. It has mature support across query engines, data lakes, catalogs, ingestion systems, and governance tooling. If the workload is mostly batch analytics, reporting, or large aggregations, Parquet’s scan-oriented design is still a very good fit. Lance becomes interesting when the workload starts to depend on repeated selective access, lower-latency retrieval, or vector-native queries. In those cases, avoiding unnecessary decoding or avoiding a separate vector indexing pipeline can matter more than raw scan throughput. A simplified comparison looks like this: areaparquetlance Best fit Large analytical scans Selective reads and vector retrieval Read path Row groups and pages Fragment and index-aware access Predicate filtering Metadata and page-level pruning Metadata/index-assisted narrowing Vector search Usually external system Native query path Ecosystem maturity Very mature Emerging Engine compatibility Broad support across engines Narrower ecosystem Updates Usually rewrite and compact Dataset-level mutation support Operational default Strong default for data lakes Better fit for specialized access patterns The operational question is not which format is generally better. The better question is where the workload spends its time. If most queries scan large portions of data, Parquet is still the right default. If the workload repeatedly asks for a small number of rows or nearest neighbors, a lookup-oriented or vector-native format becomes more attractive. What the Two Examples Show Together The filtered read example and the vector example highlight two different consequences of the same design choice. In the filtered read case, the difference appears as execution cost. Both systems can answer the query, but the amount of data processed before returning the result is different. In the vector case, the difference appears as capability. One format stores the data, while the other format also provides a native query path over it. Both cases come back to the same question: how much data must the system process before it can produce the answer? For scan-heavy analytics, Parquet remains a strong fit. That is the read path it was built for. But when workloads shift toward selective access, repeated reads, or similarity search, the main question changes. The difference is no longer just how fast data can be scanned. It is how much data the reader must process before returning the result. That is the broader storage trend. File formats are becoming more involved in the read path itself. They increasingly encode assumptions about access patterns, indexing, and retrieval. As analytical, ML, and search workloads move closer together, storage layout becomes part of query design.

By Hitarth Trivedi

Azure Databricks for Scalable MLOps and Feature Engineering With Apache Spark, Delta Lake, and MLflow

Raw data doesn't win model competitions. Features do. And when your raw data is tens of billions of rows sitting across multiple sources, you can't afford to run pandas in a notebook and call it a day. In this tutorial, I'll walk through building a production-grade feature engineering pipeline on Azure Databricks using: Apache Spark for distributed transformation at scaleDelta Lake for reliable, versioned feature storage with ACID guaranteesMLflow for tracking feature pipeline runs, parameters, and the models trained on top of them The use case is a customer churn prediction system, but the patterns apply to any ML feature pipeline. Architecture Overview The pipeline follows the Medallion Architecture — a layered approach where data gets progressively cleaner and more feature-ready as it moves from Bronze to Silver to Gold. MLflow sits across all three layers, tracking every run. Pipeline Flow Layer Breakdown LayerDelta TableWhat happens hereTypical latencyBronzechurn.bronze.eventsRaw ingest, no transforms, append onlyMinutesSilverchurn.silver.customersDeduplication, null handling, schema enforcementMinutesGoldchurn.gold.featuresAggregations, window functions, encodingMinutes to hoursMLflow RunN/ATraining, metric logging, artifact storageHoursRegistryN/AVersioned model store, stage promotionOn demand Step 1 — Bronze Layer: Raw Ingest The Bronze layer is append-only. No transforms. No business logic. Just get the data in and preserve it exactly as it arrived so you can always replay from source. Python from pyspark.sql import SparkSession from pyspark.sql.functions import current_timestamp, lit from delta.tables import DeltaTable spark = SparkSession.builder.getOrCreate() # Read raw events from ADLS Gen2 / Event Hub / source of choice raw_events = spark.read.format('json').load('abfss://[email protected]/events/') # Add ingestion metadata — never mutate source columns bronze_df = raw_events.withColumn('_ingested_at', current_timestamp()) \ .withColumn('_source', lit('events_api')) # Write to Bronze Delta table — append only, no overwrites bronze_df.write \ .format('delta') \ .mode('append') \ .option('mergeSchema', 'true') \ .saveAsTable('churn.bronze.events') print(f"Bronze rows written: {bronze_df.count()}") Why append-only? If your downstream pipeline produces bad features, you want to replay from Bronze without re-ingesting from source. Overwriting Bronze breaks that ability. Step 2 — Silver Layer: Clean and Validate Silver is where you enforce schema, handle nulls, deduplicate, and standardize. Think of it as your canonical, trusted dataset. Python from pyspark.sql.functions import col, to_timestamp, when, trim, upper from delta.tables import DeltaTable bronze = spark.table('churn.bronze.events') silver_df = bronze \ .filter(col('customer_id').isNotNull()) \ .filter(col('event_type').isNotNull()) \ .dropDuplicates(['customer_id', 'event_id']) \ .withColumn('event_ts', to_timestamp(col('event_timestamp'))) \ .withColumn('event_type', upper(trim(col('event_type')))) \ .withColumn('country_code', when(col('country').isNull(), lit('UNKNOWN')) .otherwise(upper(col('country')))) \ .select( 'customer_id', 'event_id', 'event_type', 'event_ts', 'country_code', 'product_id', 'session_id', '_ingested_at', ) # Upsert into Silver using Delta MERGE — idempotent on re-runs if DeltaTable.isDeltaTable(spark, 'churn.silver.customers'): silver_table = DeltaTable.forName(spark, 'churn.silver.customers') silver_table.alias('tgt').merge( silver_df.alias('src'), 'tgt.customer_id = src.customer_id AND tgt.event_id = src.event_id' ).whenNotMatchedInsertAll().execute() else: silver_df.write.format('delta').saveAsTable('churn.silver.customers') print(f"Silver table updated. Total rows: {spark.table('churn.silver.customers').count()}") Step 3 — Gold Layer: Feature Engineering This is the heart of the pipeline. We compute aggregated, windowed, and encoded features that the model will actually train on. Python from pyspark.sql.functions import ( col, count, countDistinct, sum as _sum, avg, datediff, max as _max, min as _min, current_date, expr, when ) from pyspark.sql.window import Window silver = spark.table('churn.silver.customers') # ------------------------------------------------------------------ # 1. Aggregate features per customer over 30 / 90 day windows # ------------------------------------------------------------------ today = current_date() agg_features = silver \ .withColumn('days_since_event', datediff(today, col('event_ts'))) \ .groupBy('customer_id') \ .agg( count('event_id') .alias('total_events'), countDistinct('session_id') .alias('total_sessions'), countDistinct('product_id') .alias('distinct_products'), _sum(when(col('days_since_event') <= 30, 1).otherwise(0)) .alias('events_last_30d'), _sum(when(col('days_since_event') <= 90, 1).otherwise(0)) .alias('events_last_90d'), _max('event_ts') .alias('last_event_ts'), _min('event_ts') .alias('first_event_ts'), ) \ .withColumn('days_since_last_event', datediff(today, col('last_event_ts'))) \ .withColumn('customer_tenure_days', datediff(today, col('first_event_ts'))) \ .withColumn('avg_events_per_day', col('total_events') / (col('customer_tenure_days') + 1)) # ------------------------------------------------------------------ # 2. Encode churn risk tier as ordinal feature # ------------------------------------------------------------------ feature_df = agg_features \ .withColumn('recency_tier', when(col('days_since_last_event') <= 7, lit(3)) # active .when(col('days_since_last_event') <= 30, lit(2)) # at risk .otherwise(lit(1)) # churned ) \ .withColumn('engagement_score', (col('events_last_30d') * 0.6 + col('events_last_90d') * 0.4) / (col('customer_tenure_days') + 1) ) # ------------------------------------------------------------------ # 3. Write to Gold feature store — overwrite with partition by date # ------------------------------------------------------------------ feature_df \ .withColumn('feature_date', current_date()) \ .write \ .format('delta') \ .mode('overwrite') \ .option('replaceWhere', f"feature_date = '{today}'") \ .saveAsTable('churn.gold.features') print(f"Gold features written: {feature_df.count()} customers") Step 4 — MLflow: Track the Training Run With features in Gold, we hand off to MLflow to train, track, and register the model. Notice we log the Delta table version so we can always reproduce exactly which feature snapshot trained which model. Python import mlflow import mlflow.sklearn from mlflow.models.signature import infer_signature from sklearn.ensemble import GradientBoostingClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import roc_auc_score, f1_score import pandas as pd mlflow.set_experiment('/churn-prediction/feature-pipeline') # Read Gold features — capture Delta version for reproducibility gold_table = DeltaTable.forName(spark, 'churn.gold.features') delta_version = gold_table.history(1).select('version').collect()[0][0] features_pdf = spark.table('churn.gold.features').toPandas() FEATURE_COLS = [ 'total_events', 'total_sessions', 'distinct_products', 'events_last_30d', 'events_last_90d', 'days_since_last_event', 'customer_tenure_days', 'avg_events_per_day', 'recency_tier', 'engagement_score', ] TARGET = 'churned' X = features_pdf[FEATURE_COLS] y = features_pdf[TARGET] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) with mlflow.start_run(run_name=f'gbm-features-v{delta_version}') as run: params = {'n_estimators': 200, 'max_depth': 5, 'learning_rate': 0.05} model = GradientBoostingClassifier(**params, random_state=42) model.fit(X_train, y_train) y_pred = model.predict(X_test) y_prob = model.predict_proba(X_test)[:, 1] # Log everything mlflow.log_params(params) mlflow.log_metric('roc_auc', roc_auc_score(y_test, y_prob)) mlflow.log_metric('f1_score', f1_score(y_test, y_pred)) mlflow.log_param('delta_feature_version', delta_version) mlflow.log_param('feature_columns', FEATURE_COLS) mlflow.log_param('training_rows', len(X_train)) # Log model with signature signature = infer_signature(X_train, y_pred) mlflow.sklearn.log_model( model, artifact_path='churn-gbm', signature=signature, registered_model_name='churn-prediction-gbm', ) print(f"Run ID: {run.info.run_id}") print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}") print(f"Feature Delta version logged: {delta_version}") Bonus: Delta Lake Time Travel for Feature Reproducibility One of the best things about Delta Lake is time travel. If a model behaves unexpectedly in production, you can reload the exact feature snapshot it was trained on. Python # Reload the exact feature version that trained a specific model run import mlflow run = mlflow.get_run('your-run-id-here') feature_version = int(run.data.params['delta_feature_version']) # Rehydrate that exact feature snapshot historical_features = spark.read \ .format('delta') \ .option('versionAsOf', feature_version) \ .table('churn.gold.features') print(f"Loaded feature snapshot from Delta version {feature_version}") print(f"Row count: {historical_features.count()}") # You can now retrain on the exact same data to reproduce the result Service Comparison ToolRole in pipelineWhy not the alternativeApache SparkDistributed feature computationPandas (single node, OOM at scale), Dask (less native Databricks integration)Delta LakeFeature storage with versioningParquet (no ACID, no time travel), Hive tables (no merge support)MLflow TrackingExperiment and param loggingManual logging (not reproducible), W&B (extra cost, less native on Databricks)MLflow RegistryModel versioning and promotionCustom model store (more ops overhead)Medallion ArchitecturePipeline layer separationFlat pipelines (hard to debug, no replay capability)Delta MERGEIdempotent Silver upsertsOverwrite (destroys history), append (creates duplicates) Things to Watch in Production Shuffle partitions matter. Spark defaults to 200 shuffle partitions, which is fine for small data but will bottleneck at scale. Set spark.conf.set("spark.sql.shuffle.partitions", "auto") on Databricks Runtime 10+ or tune it manually to 2-3x your core count. Z-ordering on Gold features. If you're querying Gold by customer_id frequently, add OPTIMIZE churn.gold.features ZORDER BY (customer_id) after the write. This co-locates related data and cuts query times dramatically on large tables. Log Delta version in every MLflow run. This is non-negotiable for reproducibility. Without it you can't prove which feature snapshot trained which model, which becomes a compliance problem in regulated industries. Cluster autoscaling for feature jobs. Feature engineering jobs tend to have spiky resource needs (big during aggregation, small during writes). Enable autoscaling on your Databricks cluster and set a min/max node count rather than a fixed size. Wrapping Up The combination of Spark, Delta Lake, and MLflow on Databricks gives you a feature engineering pipeline that is reproducible (Delta time travel + MLflow param logging), scalable (Spark handles billions of rows), and auditable (every run is tracked, every feature version is stored). The Medallion Architecture keeps the pipeline modular — you can rerun just the Gold layer if you change a feature definition without touching Bronze or Silver, and MLflow ties model performance back to the exact feature version that produced it. References Azure Databricks DocumentationDelta Lake — The Definitive GuideApache Spark SQL — Window FunctionsMLflow Tracking DocumentationMLflow Model RegistryMedallion Architecture on DatabricksDelta Lake Time TravelDatabricks Feature Store Overview

By Jubin Abhishek Soni

CORE

Databases

DZone's Featured Databases Resources

Top Databases Experts

The Latest Databases Topics