The most advanced open-source format for data lakes
Engineering
Getting Started With Qbeast Format
Qbeast Format is an open-source Table Format based on Delta Lake that enables faster queries on Cloud Storage. All the features of Delta are ensured, only extra metadata is saved on the commit log to further improve Data Skipping. Through indexing techniques, similar data is grouped together to speed up searches of particular records, and […]
Read from public S3 bucket with Spark
S3 Hadoop Compatibility Trying to read from public Amazon S3 object storage with Spark can cause many errors related to Hadoop versions. Here are some tips to configure your spark application. Spark Configuration To read the S3 public bucket, you need to start a spark-shell with version 3.1.1 or superior and Hadoop dependencies of 3.2. […]
Indexing and Sampling on Data Lake(house)s with Qbeast-Spark
Creating and leveraging indexes is an essential feature in DBMS and data warehouses to improve query speed. When applying filters in a SQL query, the index is first consulted to locate the target data before any read. In this way, we avoid reading the entire table from the storage, reducing the data transfer involved.For a […]
Code Formatting with Scalafmt
Whether you are starting a Scala project or collaborating in one, here, you have a guide to know the most used frameworks for improving the code style. Scalastyle and Scalafmt Scalastyle is a handy tool for coding style in Scala, similar to what Checkstyle does in Java. Scalafmt formats code to look consistent between people […]
Qbeast format — enhanced Data Lakehouse
Pushing Data Lakehouse to a new height. Data Lakehouse and its enhanced features with Qbeast-Spark. We all like seeing complex problems being solved by simple and elegant solutions, I can’t verbalize what it is, but the good feeling that comes with it is undeniable. At Qbeast, we want to tackle some of the difficulties in […]
Create awesome GIFs from a terminal: Nice-looking animations with Terminalizer
Have you ever wanted to generate cool GIFs from a terminal output? Do you want to have fancy animations to show some code snippets? Using terminalizer, you will be able to create fantastic animations by following this simple guide! The solution 1. First, you need to install NodeJS v12.21.0 (LTS) from https://nodejs.org/download/release/v12.21.0/. Other versions may […]
Scala Test Dive-in: Public, Private and Protected methods
We all know that testing code can be done in different ways. This pill is not to explain which is the best way to see if your Scala project is working as it should. But it will provide some tips and tricks for testing public, private, and protected methods. Public Methods Public methods are the […]
Reduce Repetitive Tasks and Development Time by Writing your own Tool in Python
For the last weeks, I’ve been working on a CLI tool to help developers of the qbeast-spark open-source project test their changes to the code. I’ll show you how I did it using setuptools. Motivation Some weeks ago, we, at Qbeast, were running tests manually, which involved several repetitive steps, which stole our developers a […]
Approximate Queries on Data Lakes with Qbeast-Spark
Approximate Queries on Data Lakes with Qbeast-Spark Jiawei Hu Writing procedural programs to analyze data often comes less handy than a declarative approach, for one has to define the exact control flow of the program rather than simply stating the desired outcome. For example, it takes a lot more to write a simple groupby using Python than […]