Cloud Warehouses Are All the Same

Cloud Warehouses Are All the Same

Until Layout Makes the difference
← Blogs /
Cloud Warehouses Are All the Same
From Chaos to Canvas: Repainting the Lakehouse with Multidimensional Indexing

In The Nimble Lakehouse, we explored why so many modern data stacks stall out after launch. Flexibility won the adoption war—but maintaining performance at scale is a different battle. The hard part isn’t building a Lakehouse. It’s keeping it fast.

When we dig into the details of each, the cloud data warehouse market has collapsed into sameness.

Across vendors, the pricing, scaling model, and query syntax may differ. But under the hood, they make the same core tradeoff: simple writes, expensive reads. Fast ingest at the cost of reactive, compute-hungry optimization.

Once you’ve seen enough pipelines, layout drift, and sort-then-cluster jobs, the pattern becomes obvious: modern warehouses optimize around data, not with it. Nothing enfources global ordering or clustering, because statless ordering slows ingestion.

At low data volumes, this choice is invisible, but over months file counts soar and data-files fragment. Queries that took seconds and minutes grow to minutes and hours. Contracts forecasted to cover multiple years become dangerously short -- not a fun conversation to have with the CFO.

Optimization Became a Job, Not a Feature

In order to keep a Lakehouse running smoothly, and within budget, data-files have tobe well organized. Tables and files include range metadata, so that query engines can be selective in files that are read and processed. However, these "file pruning" features are only as good as the data layout in the object store.

The Lakehouse ecosystem has provided two solutions to data layout: partitioning and sort-based clustering. Sometimes used independently or together, they've been the available toolset, but they quickly reveal limitations at scale:

  • Partitioning suffers from cardinality and skew. Partitioning on high-cardinality columns like timestamp or device_id leads to too many small files.
    Partitioning on low-cardinality columns like region provides limited selectivity. Skewed values, when data following the 80/20 rule, creates imbalance between files.
  • Sort-based clustering struggles with evolving query patterns. You can sort by one set of columns, but if the workload changes—or new dimensions becomeimportant—the layout no longer matches the access pattern. Clustering is also expensive to maintain in real time, requiring frequent compaction and resorting jobs.
  • Write amplification breaks scale. Reorganizing data with each update creates unnecessary churn. Teams spend more time rewriting the table than querying it.

Figure 1: Partitioning and Sort-Based Clustering Optimization Techniques To make due with these methods, customers have two choices:

1. Manage customer expectations by rationing or controlling access to the Lakehouse. When we do this, we limit the potential of Lakehouse to improve the business, and put a cap on our own business value.

2. Manually operate tables by separating use-cases between tables, creating tablesfor real-time ingestion or tables fit-to-purpose for specific end-users. Data engineering teams can spend countless hours and still struggle to constrain costs.

Without Indexes, It’s All Just Brute Force

Modern engines are brilliant. But even the smartest query planner can’t compensate for missing layout intelligence.Indexing used to be a given: B-trees, LSMs, bitmap indexes. They encoded locality, accelerated scans, and kept query plans lean. But in the rush for write throughput, the Lakehouse ecosystem has quietly discarded them.

That trade off has come due, with expensive repercussions:

  • Joins spill into full-table shufflesFilters devolve into expensive scans
  • Filters devolve into expensive scans
  • Approximate queries lose their edge due to wide reads
  • Optimization jobs perform expensive sorting and linearization

Without indexing, compute becomes the only knob you can turn—and it happens to be the biggest line-item on the bill.

Qbeast and the Return of the Index

Qbeast reintroduces indexing, but not in the same way legacy systems did. It brings layout back to the table—intelligently, incrementally, and transparently.

At its core is a multi-dimensional spatial index that organizes data based on value density across key columns. Instead of rewriting the whole table, Qbeast incrementally places new records into the right cube—preserving layout as it grows.

✅ The result?

  • 99%+ of irrelevant files skipped at read time
  • No lockstep with compaction schedules
  • No special query syntax or engine extensions

The result is that queries can run flexibly. Engineering teams an support highperformance ingestion without having to design each table to support specific consumers, use-cases, or queries.

In one customer example, we encountered a real-time table. Sort-based clustering wasn't considered for this table as updates are coming faster than optimization can keep up. In order to keep compaction efficient, the table was indexed exclusively for insertion-date. However, one example query queried on a contractual "as-of" date, which does not correlate to insertion. The result was that each query scanned the fulltable, nearly 24B records, taking ~ 30 minutes on our test cluster.

With Qbeast, we indexed based on insertion-date to maintain write-efficiency, but also indexed on the contract date, and vehicle class. The result was that file-skippingwas able to reduce scanning from 24B to 600M records: a reduction of 97.5%!

From Dashboards to Domains: Why Indexing Is Infrastructure

BI dashboards were just the beginning. Today’s data platforms are asked to do muchmore—MCP servers support agentic backends, power search indexes, and support LLM pipelines. But most tables are still organized for yesterday’s workloads.Query latency, I/O cost, and model responsiveness all hinge on how well your data layout supports multidimensional access. And that requires structure.

  • Domain filtering is foundational. Whether trimming context for a chatbot or narrowing scope for a time-bounded query, structured filters like org_id , timestamp , and region should come first. Without an index, engines sift through hundreds of gigabytes to answer a question that touches megabytes.

Qbeast changes that by embedding layout-aware intelligence into every insert—so even high-velocity, multi-tenant data remains navigable without pipelines or post-hocoptimization.

As workloads shift from dashboards to LLM-powered applications, layout matters more than ever. Retrieval-Augmented Generation (RAG) pipelines often combine structured filters with vector similarity. Without pre-filtering, vector search must consider massive candidate sets—driving up token usage and latency.

Qbeast changes that:

  • Structured filters (e.g., region , timestamp , customer_id ) are pruned early
  • Downstream latency and GPU time collapse

Our ability to accelerate from outside the query path enables GPU data tools like NVIDIA Rapids and cuDF can integrate easily without needing bespoke integration engineering. Advancements in this space move so quickly, the only way to ensure that your Lakehouse tables keep pace with innovation is to support the existing open standards.

This isn't just a query optimization—it's an architectural shift for real-time AI at scale.

Layout Intelligence: The Last Strategic Layer

Taking the long term view, we can see this as the latest progression in the disaggregated analytics stack. Hadoop commoditized storage hardware, and parallelprocessing. Columnar formats like Parquet standardized scan metadata and efficiency. Apache Spark introduced sophisticated in-memory computation, and application of these technologies to public cloud brought elasticity.

Each wave has removed a bottleneck:

  • Storage is cost-efficient & abundant.
  • Compute is elastic & on-demand.
  • Engines are interchangeable

But layout remains the lever most orgs haven’t pulled.

Qbeast makes layout intelligent, adaptive, and invisible. It’s not a feature—it’s a foundation. And it reclaims the original Lakehouse promise:

  • A single table for all access patterns
  • Performance without pipelines
  • Simplicity that survives scale