Qbeast Technology | Smarter Indexing for Open Data Lakes

Databricks: Thank you for the tremendous response at Databricks AI Summit 2025. Your interest in our work inspires us to keep pushing the boundaries of data innovation.

| Qbeast Secures $7.6M Seed Funding from PeakXV to Help Open Data Platforms Scale Efficiently. |

Databricks: Thank you for the tremendous response at Databricks AI Summit 2025. Your interest in our work inspires us to keep pushing the boundaries of data innovation.

| Qbeast Secures $7.6M Seed Funding from PeakXV to Help Open Data Platforms Scale Efficiently |

Databricks: Thank you for the tremendous response at Databricks AI Summit 2025. Your interest in our work inspires us to keep pushing the boundaries of data innovation.

Qbeast adds spatial index to the Delta Lake with minimal additional overhead, describing the global distribution of values in the table.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

▶ tail -1 _delta_log/00000000000000000000.json | jq
{
  "add": {
    "path": "e24973a3-b8ba-4bez-8b2f-4ac60a55458c.parquet",
    "modificationTime": 1634732079000,
    "dataChange": true,
    "tags": {
      "rowCount": 177,
      "indexedColumns": "ss_cdemo_sk,ss_hdemo_sk",
      "cube": "Qw",
      "space": {
        "timestamp": 1634732047219,
        "transformations": [
          { "min": -960393.5, "max": 2881188.5, "scale": 6.94540843E-5 },
          ...
        ]
      },
      "minWeight": -2147483648, "maxWeight": 2147483647
    },
  },
...
}

Qbeast optimization enhances existing Delta Lake table readers. Tables remain fully compatible with any Delta Lake client application.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

▶ tree ./
./
├── _delta_log/
├  └─ 00000000000000000000.json
├── _qbeast/
|  └─ insights/
|    └─ ...
├── 03ac7b97-99a3-4d48-a258-23b91ee53e15.parquet
├── 3c642315-e0f9-4f94-a2ce-41b28fc0964f.parquet
├── 55876ada-cdde-4214-b92b-001b83c0f2ae.parquet
├── ae0142b8-c720-453a-aab4-de05a262ff7f.parquet
├── bbc47b42-86f9-4c72-9e2c-b7ccfd978e12.parquet
├── e24973a3-b8ba-4be2-8b2f-4ac60a55458c.parquet
├── e25dfb75-96d2-4887-86b3-d3c463fc2c29.parquet
├── e921d672-dc3a-4f49-b592-e6005b2ea89.parquet
├── eb140b57-1e8f-4e1a-85b3-9df71069bc80.parquet
├── f7ef973-a796-43ee-a0fd-889f8819034.parquet
├── fea16f6f-c33a-4bde-aa11-863845e83b90.parquet
2 directories, 13 files

Qbeast adds spatial index to Apache Hudi metadata with minimal additional overhead, describing the global distribution of values in the table.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
‍

▶ grep ^qbeast hoodie.properties | jq
qbeast.configuration={
  hoodie.datasource.write.precombine.field": "date",
    "cubeSize": "3000000",
    "hoodie.metadata.index.column.stats.enable": "true",
    "samplingenabled": "true",
    "qbeast.revision.1": {
      "revisionID": 1,
      "timestamp": 1751555947465,
      "tableID": "qbeast-table/",
      "desiredCubeSize": 3000000,
      "columnTransformers": [
        {"className": "QuantilesTransformer","columnName": "date"}
      ],
      "transformations": [
        {
         "className": "QuantilesTransformation",
         "quantiles": ["2024-05-14T13:59:59.000Z", "2024-07-17...]
        }
      ]
    }
  }

Qbeast optimization enhances existing Apache Hudi table readers. Tables remain fully compatible with any Hudi client application.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

▶ tree -a . -L2
./
├── .hoodie/
├  └─ hoodie.properties
├  └─ metadata
├  └─ timeline
├── _qbeast/
|  └─ insights/
|    └─ ...
├── 03ac7b97-99a3-4d48-a258-23b91ee53e15.parquet
├── 3c642315-e0f9-4f94-a2ce-41b28fc0964f.parquet
├── 55876ada-cdde-4214-b92b-001b83c0f2ae.parquet
├── ae0142b8-c720-453a-aab4-de05a262ff7f.parquet
├── bbc47b42-86f9-4c72-9e2c-b7ccfd978e12.parquet
├── e24973a3-b8ba-4be2-8b2f-4ac60a55458c.parquet
├── e25dfb75-96d2-4887-86b3-d3c463fc2c29.parquet
├── e921d672-dc3a-4f49-b592-e6005b2ea89.parquet
├── eb140b57-1e8f-4e1a-85b3-9df71069bc80.parquet
├── f7ef973-a796-43ee-a0fd-889f8819034.parquet
├── fea16f6f-c33a-4bde-aa11-863845e83b90.parquet
5 directories, 12 files

Data Skipping

The usage of an index helps avoid reading the entire dataset, reducing the amount of data transfer involved and speeding up the query. Qbeast allows you to index your data on as many columns as you need and filter directly the files to answer the search.

Approximate Queries

Qbeast enables approximate queries, the ability to provide approximate answers to queries at a fraction of the cost of executing the query. With the Qbeast-Spark, you can access a statistical representative sample of the dataset and return the result of the query within a margin of error.

File Optimization

When writing new data, the file layout could be harmed, producing lots of small files or heavily large ones, making it uneasy to retrieve the results with the least noise possible. Optimization fixes the overflowed areas and improves the query's useful payload by reading more fine-grained files.

Easy to Deploy

It works with any Data Lake storage (S3, Azure and GCS) and is compatible with any BI/ML tool of your choice. Only takes 10 minutes to deploy and enjoy the benefits of querying Qbeast Tables.

Getting Started

Using Qbeast with Spark

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

// Write to a path with Qbeast
val path = /path/to/qbeast"
df.write.format("qbeast").options("columnsToIndex", "id,age").save(path)

// Or save the data as a Table
df.write.format("qbeast").options("columnsToIndex", "id,age").saveAsTable("qbeast_table")

// Read from the underlying formspant
val myTableFormat = "delta" // Or hudi or iceberg, depending on the configuration you choose
val data = spark.read.format(myTableFormat).load(path)

// And run any type of query
data.filter("id < 1000 and age = 42").show()

// Or use qbeast for Sampling Pushdown
spark.read.format("qbeast").load(path).sample(0.1).show()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

// Write to a path with Qbeast
path = /path/to/qbeast"
df.write.format("qbeast").options("columnsToIndex", "id,age").save(path)

// Or save the data as a Table
df.write.format("qbeast").options("columnsToIndex", "id,age").saveAsTable("qbeast_table")

// Read from the underlying format
myTableFormat = "delta" // Or hudi or iceberg, depending on the configuration you choose
data = spark.read.format(myTableFormat).load(path)

// And run any type of query
data.filter("id < 1000 and age = 42").show()

// Or use qbeast for Sampling Pushdown
spark.read.format("qbeast").load(path).sample(0.1).show()

1
2
3
4
5
6
7
8
9
10
11
12
13
14

-- Create Qbeast table with index>
CREATE TABLE qbeast_table(id INT, name STRING)
USING qbeast OPTIONS (columnsToIndex 'id,age')

-- Insert into with select from another table
INSERT INTO table qbeast_table SELECT * FROM df_view

-- You can also query the Qbeast-managed table directly
SELECT * FROM qbeast_table;

-- Run range queries
SELECT * FROM qbeast_table WHERE id < 1000 AND age = 42

-- And execute Sample queries
SELECT * FROM qbeast_table TABLESAMPLE(10 PERCENT);

1
2
3
4
5
6
7
8
9
10
11
12
13
14

Examples

Seamlessly integrate Qbeast with Databricks, Snowflake and more. Automate your data workflows and unlock faster, sharper insights so your team can focus on what matters.

Multicolumn Filtering

SELECT * FROM customers
WHERE age > 20
AND city = ‘Barcelona’

Approximate Queries

SELECT AVG(age)
FROM customers WHERE city = ‘Barcelona’ TABLESAMPLE(1 PERCENT)

Qbeast enables approximate queries, the ability to provide approximate answers to queries at a fraction of the cost of executing the query. With the Qbeast, you can access a statistically representative sample of the dataset and return the result of the query within a margin of error.

Auto Optimization

ALTER TABLE customers SET PROPERTIES (
"use.optimization.AutoOptimize" =
"enabled"
)

As your table grows, Qbeast optimization will dynamically adjust to the shape and density of your data. Our unique index delivers balanced file sizes, with records grouped based on the dimensions interesting for your business and use-cases.

Less Shuffling for Joins & Aggregations

SELECT c.country, COUNT(*), SUM(o.amount)
FROM orders o JOIN customers c
ON o.customer_id = c.customer_id
GROUP BY c.country;

Shuffling is often a dominant cost in analytics queries. Qbeast optimization can reduce or eliminate costly data shuffling for joins and aggregations by strategically scheduling data partition tasks to the same executors.

The most advanced
open-table optimization for data lakes

Built for Seamless Integration