The most advanced
open-table optimization
for data lakes

Built for Seamless Integration

Store your data where you want, analyze it how you like — our platform works out of the box with leading object storage solutions and is fully compatible with your favorite BI tools and data processing engines.
Works in any Data Lake
Compatible with all BI and Transformation Tools

How we organize our data

Qbeast adds spatial index to the Delta Lake with minimal additional metadata, describing the global distribution of values in the table.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
$ tail -1 _delta_log/00000000000000000000.json | jq
{
  "add": {
    "path": "e24973a3-b8ba-4bez-8b2f-4ac60a55458c.parquet",
    "modificationTime": 1634732079000,
    "dataChange": true,
    "tags": {
      "rowCount": 177,
      "indexedColumns": "ss_cdemo_sk,ss_hdemo_sk",
      "cube": "Qw",
      "space": {
        "timestamp": 1634732047219,
        "transformations": [
          { "min": -960393.5, "max": 2881188.5, "scale": 6.94540843E-5 },
          ...
        ]
      },
      "minWeight": -2147483648, "maxWeight": 2147483647
    },
  },
...
}
Qbeast adds spatial index to the Delta Lake with minimal additional metadata, describing the global distribution of values in the table.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
qbeast:/tmp/qbeast-table$ grep ^qbeast.configuration= .hoodie/hoodie.properties | jq
qbeast.configuration={
hoodie.datasource.write.precombine.field": "date",
  "cubeSize": "3000000",
  "hoodie.metadata.index.column.stats.enable": "true",
  "samplingenabled": "true",
  "qbeast.revision.1": {
    "revisionID": 1,
    "timestamp": 1751555947465,
    "tableID": "qbeast-table/",
    "desiredCubeSize": 3000000,
    "columnTransformers": [
      {
       "className": "QuantilesTransformer"span>,
       "columnName": "date"span>
      }
    ],
    "transformations": [
      {
       "className": "QuantilesTransformation"span>,
       "quantiles": [
       "2024-05-14T13:59:59.000Z"span>,
       "2024-07-17T14:10:16.592Z"span>,
       ...
       ]       }     ]   } }
Qbeast optimization enhances existing Delta Lake table readers. Tables remain fully compatible with any Delta Lake client application.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
tree ./
./
├── _delta_log/
├  └─ 00000000000000000000.json
├── _qbeast/
|  └─ insights/
|    └─ ...
├── 03ac7b97-99a3-4d48-a258-23b91ee53e15.parquet
├── 3c642315-e0f9-4f94-a2ce-41b28fc0964f.parquet
├── 55876ada-cdde-4214-b92b-001b83c0f2ae.parquet
├── ae0142b8-c720-453a-aab4-de05a262ff7f.parquet
├── bbc47b42-86f9-4c72-9e2c-b7ccfd978e12.parquet
├── e24973a3-b8ba-4be2-8b2f-4ac60a55458c.parquet
├── e25dfb75-96d2-4887-86b3-d3c463fc2c29.parquet
├── e921d672-dc3a-4f49-b592-e6005b2ea89.parquet
├── eb140b57-1e8f-4e1a-85b3-9df71069bc80.parquet
├── f7ef973-a796-43ee-a0fd-889f8819034.parquet
├── fea16f6f-c33a-4bde-aa11-863845e83b90.parquet
2 directories, 13 files

Data Skipping

The usage of an index helps avoid reading the entire dataset, reducing the amount of data transfer involved and speeding up the query. Qbeast allows you to index your data on as many columns as you need and filter directly the files to answer the search.

Approximate Queries

Qbeast enables approximate queries, the ability to provide approximate answers to queries at a fraction of the cost of executing the query. With the Qbeast-Spark, you can access a statistical representative sample of the dataset and return the result of the query within a margin of error.

File Optimization

When writing new data, the file layout could be harmed, producing lots of small files or heavily large ones, making it uneasy to retrieve the results with the least noise possible. Optimization fixes the overflowed areas and improves the query's useful payload by reading more fine-grained files.

Easy to Deploy

It works with any Data Lake storage (S3, Azure and GCS) and is compatible with any BI/ML tool of your choice. Only takes 10 minutes to deploy and enjoy the benefits of querying Qbeast Tables.

Getting Started

Using Qbeast with Spark

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// Write to a path with Qbeast
val path = /path/to/qbeast"
df.write.format("qbeast").options("columnsToIndex", "id,age").save(path)

// Or save the data as a Table
df.write.format("qbeast").options("columnsToIndex", "id,age").saveAsTable("qbeast_table")

// Read from the underlying formspant
val myTableFormat = "delta" // Or hudi or iceberg, depending on the configuration you choose
val data = spark.read.format(myTableFormat).load(path)

// And run any type of query
data.filter("id < 1000 and age = 42").show()

// Or use qbeast for Sampling Pushdown
spark.read.format("qbeast").load(path).sample(0.1).show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// Write to a path with Qbeast
path = /path/to/qbeast"
df.write.format("qbeast").options("columnsToIndex", "id,age").save(path)

// Or save the data as a Table
df.write.format("qbeast").options("columnsToIndex", "id,age").saveAsTable("qbeast_table")

// Read from the underlying format
myTableFormat = "delta" // Or hudi or iceberg, depending on the configuration you choose
data = spark.read.format(myTableFormat).load(path)

// And run any type of query
data.filter("id < 1000 and age = 42").show()

// Or use qbeast for Sampling Pushdown
spark.read.format("qbeast").load(path).sample(0.1).show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Create Qbeast table with index>
CREATE TABLE qbeast_table(id INT, name STRING)
USING qbeast OPTIONS (columnsToIndex 'id,age')

-- Insert into with select from another table
INSERT INTO table qbeast_table SELECT * FROM df_view

-- You can also query the Qbeast-managed table directly
SELECT * FROM qbeast_table;

-- Run range queries
SELECT * FROM qbeast_table WHERE id < 1000 AND age = 42

-- And execute Sample queries
SELECT * FROM qbeast_table TABLESAMPLE(10 PERCENT);
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Create Qbeast table with index>
CREATE TABLE qbeast_table(id INT, name STRING)
USING qbeast OPTIONS (columnsToIndex 'id,age')

-- Insert into with select from another table
INSERT INTO table qbeast_table SELECT * FROM df_view

-- You can also query the Qbeast-managed table directly
SELECT * FROM qbeast_table;

-- Run range queries
SELECT * FROM qbeast_table WHERE id < 1000 AND age = 42

-- And execute Sample queries
SELECT * FROM qbeast_table TABLESAMPLE(10 PERCENT);
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Create Qbeast table with index>
CREATE TABLE qbeast_table(id INT, name STRING)
USING qbeast OPTIONS (columnsToIndex 'id,age')

-- Insert into with select from another table
INSERT INTO table qbeast_table SELECT * FROM df_view

-- You can also query the Qbeast-managed table directly
SELECT * FROM qbeast_table;

-- Run range queries
SELECT * FROM qbeast_table WHERE id < 1000 AND age = 42

-- And execute Sample queries
SELECT * FROM qbeast_table TABLESAMPLE(10 PERCENT);
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Create Qbeast table with index>
CREATE TABLE qbeast_table(id INT, name STRING)
USING qbeast OPTIONS (columnsToIndex 'id,age')

-- Insert into with select from another table
INSERT INTO table qbeast_table SELECT * FROM df_view

-- You can also query the Qbeast-managed table directly
SELECT * FROM qbeast_table;

-- Run range queries
SELECT * FROM qbeast_table WHERE id < 1000 AND age = 42

-- And execute Sample queries
SELECT * FROM qbeast_table TABLESAMPLE(10 PERCENT);

Examples

Seamlessly integrate Qbeast with Databricks, Snowflake and more. Automate your data workflows and unlock faster, sharper insights so your team can focus on what matters.

Multicolumn Filtering

SELECT * FROM customers WHERE age > 20 and city = ‘Barcelona’
The usage of an index helps avoid reading the entire dataset, reducing the amount of data transfer involved and speeding up the query. Qbeast allows you to index your data on as many columns as you need and filter directly the files to answer the search.

Approximate Queries

SELECT avg(age) FROM customers WHERE city = ‘Barcelona’ TABLESAMPLE
(1 PERCENT)
Qbeast enables approximate queries, the ability to provide approximate answers to queries at a fraction of the cost of executing the query. With the Qbeast, you can access a statistically representative sample of the dataset and return the result of the query within a margin of error.

Continuous Optimization

QbeastTable.forPath(spark, tmpDir).optimize()
As your table grows, Qbeast optimization will dynamically adjust to the shape and density of your data. Our unique index delivers balanced file sizes, with records grouped based on the dimensions interesting for your business and use-cases.

Reduce Shuffling for Joins & Aggregations

SELECT c.country,  COUNT(DISTINCT o.customer_id) AS total_customers, SUM(o.amount) AS total_revenue
FROM orders o
JOIN customers c
ON o.customer_id = c.customer_id 
WHERE o.order_date BETWEEN '2023-01-01' AND '2023-12-31'GROUP BY c.country;
Shuffling is often a dominant cost in analytics queries. Qbeast optimization can reduce or eliminate costly data shuffling for joins and aggregations by strategically scheduling data partition tasks to the same executors.