The most advanced open-source format
for data lakes

ACID properties
Multi-column index
Efficient sampling
Resource saving

How we organize data

The OTree: Quadtree + Sampling

Index Metadata

Based on Delta Lake format, Qbeast adds the necessary information to query efficiently
We organize the data in what we call “cubes”. Each cube’s elements are written in a single parquet file, allowing the query engine to filter out some of them before reading their content.

Apache Spark Integration

    • df.write.format("qbeast").option("columnsToIndex",
    • (df.write.format("qbeast").option("columnsToIndex",
    • val qbeastDf ="qbeast").load("your-storage-path")
    • (qbeastDf ="your-storage-path", format="qbeast"))
    • qbeastDf.sample(0.1).show
    • qbeastDf.sample(0.1).show
    • qbeastDf.createOrReplaceTempView("qbeast_table")
      spark.sql("SELECT * FROM qbeast_table TABLESAMPLE(1 PERCENT)")
Write on your favourite object storage
  • df.write.format("qbeast")
  • (df.write.format("qbeast")
Load the data onto a Spark Data Frame
  • val qbeastDf ="qbeast")
  • (qbeastDf =
And query with sampling
  • qbeastDf.sample(0.1).show
  • qbeastDf.sample(0.1).show
  • qbeastDf.createOrReplaceTempView(
    spark.sql("SELECT * FROM
  • SELECT * FROM customers
    WHERE age > 20 and city = ‘Barcelona’

    The usage of an index helps avoid reading the entire dataset, reducing the amount of data transfer involved and speeding up the query. Qbeast Format allows you to index your data on as many columns as you need and filter directly the files to answer the search.

  • SELECT avg(age) FROM customers
    WHERE city = ‘Barcelona’ TABLESAMPLE(1 PERCENT)

    Qbeast enables approximate queries, the ability to provide approximate answers to queries at a fraction of the cost of executing the query. With the Qbeast Format, you can access a statistical representative sample of the dataset and return the result of the query within a margin of error.

  • QbeastTable.forPath(spark, tmpDir).

    When writing new data, some areas of the index could start to overflow. Having overflowed cubes implies that larger files need to be read at query time, most of them containing information that is not relevant for the result. Optimization replicates the records from the most used nodes and improves the query useful payload by reading more fine-grained files.

  • QbeastTable.forPath(spark, tmpDir).

    On high writing workloads, we usually end up appending short subsets of data to the table. This results in writing many small files on the Data Lake, which can cause performance issues. To address this, we implement the Compact operation, based on Delta Lake’s Optimize, that arranges the small files and compacts them into a single bigger file.

Get started with the Qbeast Format on Github!

© 2020 Qbeast
Design by Xurris