Data Warehouses Meet Data Lakes
For many years, there has been a heated debate on structuring the data that powers all data-driven applications. While data warehouses offer simplicity through managed service, they come with inherent rigidity. In contrast, data lakes promote open formats, protocols, and interoperability but introduce the complexity of abundant choices. While the former is tailored to house curated, finely structured data, the latter is designed to store all types of information.
Data warehouses adopt open table formats
Snowflake has long supported reading files from data lakes. In 2022, they added support for both Apache Iceberg and Delta Lake, a strategic move that underscores their commitment to robust data management. Now, the company is releasing managed tables on data lakes stored with Apache Iceberg. The new functionality allows querying a data lake from Snowflake with better performance than external tables, thanks to its caching and specific optimizations. Customers can update the data lake with an external tool and still get as good performance on Snowflake queries as if they stored the data inside. By betting on Apache Iceberg, Snowflake distances itself from Databricks, which builds its technology around Delta Lake.
Similarly, Google is trying to bridge the gap between Big Query and data lakes with its Big Query Omni and Big Lake offerings. The goal is to allow querying a data lake from Big Query with acceptable performance, even in multi-cloud setups.
This strategic shift towards supporting data lakes is driven by customers’ needs. Traditional Data Science, Machine Learning, and AI in data warehouses have high costs and limitations. Therefore, industry leaders are enhancing their flexibility and adaptability.
Given the adoption of open formats by data warehouses, Delta Lake 3.0 released UniForm to ensure its compatibility with any system capable of reading Apache Iceberg or Apache Hudi.
The new use cases driving data lakes adoption
If there’s one constant in the data world, it’s this: the data is growing. Companies rooted in data warehouses are increasingly exploring data lakes to support applications like fraud detection, recommendation systems, and IoT monitoring. The success of these applications hinges on the storage system’s capacity to handle vast data volumes. Accessing the same data repeatedly using diverse tools can turn expensive in data warehouses. Consequently, there’s a growing inclination to build a data lake on an open table format, mitigating these costs and democratizing internal access to data.
Given that open table formats empower all data use cases, it is crucial to make an informed decision when choosing the technology, but also to consider who is going to maintain the dataset well structured.
As data goes into the data lake, it needs to be ingested in a timely manner, but most importantly, it needs to be kept organized. New small files need to be compacted with bigger ones, the dataset might be reorganized to better filter the files that have the relevant data, cleanup needs to happen on old files and metadata, and a long, etc. That’s when a company like Qbeast can help by managing all these tasks and keeping the data in an optimal state. By storing the data in Qbeast Format, the data is organized by similarity and with incremental samples, making the filtering more efficient from any data engine, let it be a data warehouse or a data lake processing engine!
The Blurring Lines
Here are my two cents from Big Data London 2023: data warehouses and data lakes might seem like competing siblings, but they’re starting to play nice. Snowflake’s managed iceberg tables exemplify this, allowing reverse ETLs from Snowflake even as new data populates the data lake. Databricks is pushing their managed SQL Serverless endpoint, albeit proprietary, for analytics, while Microsoft’s Fabric emerges as a one-stop solution for data lakes with Delta Lake. In this complex ecosystem, Qbeast ingests the data and optimizes the data layout to make all of these players more efficient!
Want to learn more? Book a call with us!