The most advanced open-source format for data lakes

ACID Properties
Multi-column index
Efficient sampling
Resource saving

Works in any Data Lake

Store the data in the Object Storage of your preference.

Compatible with all BI tools

Plug in your visualizer and get faster insights.

How we organize data

Qbeast Metadata

Based on Delta Lake format, Qbeast adds the necessary information to query efficiently

We organize the data in what we call “cubes”. Each cube’s elements are written in a single parquet file, allowing the query engine to filter out some of them before reading their content.

Data Skipping

The usage of an index helps avoid reading the entire dataset, reducing the amount of data transfer involved and speeding up the query. Qbeast Format allows you to index your data on as many columns as you need and filter directly the files to answer the search.

Approximate Queries

Qbeast enables approximate queries, the ability to provide approximate answers to queries at a fraction of the cost of executing the query. With the Qbeast Format, you can access a statistical representative sample of the dataset and return the result of the query within a margin of error.

File Optimization

When writing new data, the file layout could be harm, producing lots of small files or heavily large ones, making uneasy to retrieve the results with the less noise possible. Optimization fixes the overflowed areas and improves the query useful payload by reading more fine-grained files.

Easy to Deploy

It works with any Data Lake storage (S3, Azure..) and is compatible with any BI/ML tool of your choice. Only takes 10 minutes to deploy and enjoy the benefits of querying Qbeast Tables.

Getting started

val qbeast_df =
   spark
     .read
     .format("qbeast")
     .load("s3://my-bucket/my-qbeast-table)
qbeast_df = 
   spark
     .read
     .format("qbeast")
     .load("s3://my-bucket/my-qbeast-table)
val df = spark.read.format("csv").load(srcPath)

df.write
	.mode("overwrite")
	.format("qbeast")
	.option("columnsToIndex", "user_id,product_id")
	.save(destPath)
df = spark.read.format("csv").load(srcPath)

df.write
	.mode("overwrite")
	.format("qbeast")
	.option("columnsToIndex", "user_id,product_id")
	.save(destPath)
CREATE TABLE purchases (id INT, user_id INT, product_id STRING)
    USING qbeast OPTIONS ('columnsToIndex'='user_id,product_id');

INSERT INTO table purchases SELECT id, user_id, product_id FROM raw_purchases;
Favorite object storage

Examples

HeadMulticolumn filtering

SELECT * FROM customersWHERE age > 20 and city = ‘Barcelona’

The usage of an index helps avoid reading the entire dataset, reducing the amount of data transfer involved and speeding up the query. Qbeast Format allows you to index your data on as many columns as you need and filter directly the files to answer the search.

Approximate Queries

SELECT avg(age) FROM customersWHERE city = ‘Barcelona’ TABLESAMPLE(1 PERCENT)

Qbeast enables approximate queries, the ability to provide approximate answers to queries at a fraction of the cost of executing the query. With the Qbeast Format, you can access a statistical representative sample of the dataset and return the result of the query within a margin of error.

Optimitzation

QbeastTable.forPath(spark, tmpDir).optimize()

When writing new data, some areas of the index could start to overflow. Having overflowed cubes implies that larger files need to be read at query time, most of them containing information that is not relevant for the result. Optimization replicates the records from the most used nodes and improves the query useful payload by reading more fine-grained files.

HeadMulticolumn filtering

SELECT * FROM customersWHERE age > 20 and city = ‘Barcelona’

The usage of an index helps avoid reading the entire dataset, reducing the amount of data transfer involved and speeding up the query. Qbeast Format allows you to index your data on as many columns as you need and filter directly the files to answer the search.

Approximate Queries

SELECT avg(age) FROM customersWHERE city = ‘Barcelona’ TABLESAMPLE(1 PERCENT)

When writing new data, some areas of the index could start to overflow. Having overflowed cubes implies that larger files need to be read at query time, most of them containing information that is not relevant for the result. Optimization replicates the records from the most used nodes and improves the query useful payload by reading more fine-grained files.

Optimitzation

QbeastTable.forPath(spark, tmpDir).optimize()

When writing new data, some areas of the index could start to overflow. Having overflowed cubes implies that larger files need to be read at query time, most of them containing information that is not relevant for the result. Optimization replicates the records from the most used nodes and improves the query useful payload by reading more fine-grained files.

Get started with the Qbeast Format on Github! GITHUB