Getting Started With Qbeast Format
Qbeast Format is an open-source Table Format based on Delta Lake that enables faster queries on Cloud Storage.
All the features of Delta are ensured, only extra metadata is saved on the commit log to further improve Data Skipping. Through indexing techniques, similar data is grouped together to speed up searches of particular records, and optimizes the query to execute sampling pushdown.Since it is based on widely used technologies, Qbeast Format is compatible with your favorite data tools. It only takes a few lines of code to change the pipelines to use Qbeast Tables. On this page, we are going to guide you through configuration and show you a few examples of how to read and write data with Qbeast Format.
Set up Apache Spark with Qbeast
Qbeast Format is available to Read and Write with Spark Cluster. The corresponding libraries are published on Maven Central Repository and are freely available to any end user. Here’s a step-by-step guide to configure Java, Spark and Qbeast.
Pre-requisites: Java and Apache Spark
The Spark codebase is executed by the JVM. If you are working on your local computer and want to experiment with Spark and Qbeast, make sure you install the proper version of Java (8, 11, or 17) and set up $PATH and $JAVA_HOME properly.
You can find all the details of the versions in the official documentation of Apache Spark. If you have a MAC OS with M1 Chip, we recommend following the instructions in our blog post.
For Spark Version 3.3.0, we will install Java8 by doing:
#MAC_OS
# LINUX
> brew install openjdk@8
> sudo apt-get install openjdk-8-jdk
1. Download Spark
Get Spark from the downloads page of the project website. It is not necessary to install any executable: just download the JAR from the official repositories and configure the proper environment variables.
#Download JAR
> wget https://www.apache.org/dyn/closer.lua/spark/spark-3.3.1/spark-3.3.0-bin-hadoop3.tgz
#UNTAR
> tar xzvf spark-3.3.0-bin-hadoop3.tgz
#Configure SPARK_HOME
> export SPARK_HOME=$PWD/spark-3.3.0-bin-hadoop3
2. Start Spark Shell
Start an spark-shell with the corresponding configuration and libraries:
> $SPARK_HOME/bin/spark-shell \
--packages io.qbeast:qbeast-spark_2.12:0.4.0,io.delta:delta-core_2.12:2.1.0 \
--conf spark.sql.extensions=io.qbeast.spark.internal.QbeastSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=io.qbeast.spark.internal.sources.catalog.QbeastCatalog
Set up Pyspark
If your preferred code language is Python, you can play with Qbeast Datasource via pyspark. For using pyspark, you either can install with pip or use the binary in $SPARK_HOME.
1. Install Pyspark
Choose the correct version of pyspark.
> pip install pyspark==3.3.0 #<compatible-spark-version>
2. Launch a Pyspark shell
Launch a shell with Qbeast libraries directly with the installed pyspark package:
> pyspark --packages io.qbeast:qbeast-spark_2.12:0.4.0,io.delta:delta-core_2.12:2.1.0 \
--conf spark.sql.extensions=io.qbeast.spark.internal.QbeastSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=io.qbeast.spark.internal.sources.catalog.QbeastCatalog
Or use the same $SPARK_HOME binary files:
> $SPARK_HOME/bin/pyspark \
--packages io.qbeast:qbeast-spark_2.12:0.4.0,io.delta:delta-core_2.12:2.1.0 \
--conf spark.sql.extensions=io.qbeast.spark.internal.QbeastSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=io.qbeast.spark.internal.sources.catalog.QbeastCatalog
Write Data
Once you have everything set up, we can start using Qbeast Format as a Datasource in Spark. All the standard APIs used for reading and writing are available to manipulate data with Qbeast Format.
With Scala:
val df = Seq((1, 1000371, 1.8, 15.32, "N"), (2, 1000372, 2.5, 22.15, "N"), (2, 1000373, 0.9, 9.01, "N"), (1, 1000374, 8.4, 42.13, "Y")).toDF("vendor_id", "trip_id", "trip_distance", "fare_amount", "store_and_fwd_flag")
// OVERWRITE
df.write.mode(“overwrite”).format(“qbeast”).option(“columnsToIndex”, “vendor_id,trip_id,trip_distance”).save(“/tmp/qbeast_table”)
// APPEND
df.write.mode(“append”).format(“qbeast”).option(“columnsToIndex”, “vendor_id,trip_id,trip_distance”).save(“/tmp/qbeast_table”)
With Python:
from pyspark.sql import DataFrame
data = [(1, 1000371, 1.8, 15.32, ‘N’), (2, 1000372, 2.5, 22.15, ‘N’), (2, 1000373, 0.9, 9.01, ‘N’), (1, 1000374, 8.4, 42.13, ‘Y’)]
df = spark.sparkContext.parallelize(data).toDF([“vendor_id”, “trip_id”, “trip_distance”, “fare_amount”, “store_and_fwd_flag”])
# OVERWRITE
df.write.mode(“overwrite”).format(“qbeast”).option(“columnsToIndex”, “vendor_id,trip_id,trip_distance”).save(“/tmp/qbeast_table”)
# APPEND
df.write.mode(“append”).format(“qbeast”).option(“columnsToIndex”, “vendor_id,trip_id,trip_distance”).save(“/tmp/qbeast_table”)
Read Data
For reading the data, we can also use DataFrame API. It includes every predicate on SQL: from Filter to GroupBy and Aggregations.
With Scala:
val df = spark.read.format("qbeast").load("/tmp/qbeast_table")
df.show()
With Python:
df = spark.read.format("qbeast").load("/tmp/qbeast_table")
df.show()
And you can transform a Spark DataFrame to a Pandas DataFrame by calling the method toPandas:dfPandas = df.toPandas()
Summary
In this post, we describe the steps to install Java, Spark, and Qbeast on your local machine and get a taste of how easy it is to integrate Qbeast Format with other tools.
Stay tuned for a Second Part with more SQL examples and queries. 🙂
To know more about us, you can star the project on Github and enroll in our Slack Channel. We are happy to assist you on your data journey!
Want to know more?
Book a call with Paola to know how we can help your company: https://calendly.com/paolapardo/30min