Tag: Big Data Analytics

Why the industry is moving towards an open data ecosystem

November 3, 2022

Is vendor lock-in suddenly out of fashion? Looking at recent headlines, it very much seems so.

Google: “Building the most open data cloud ecosystem: Unifying data across multiple sources and platforms” 

Google announced several steps to provide the most open and extensible Data Cloud and to promote open standards and interoperability between popular data applications. Some of the most interesting steps are the following:

  • Support for major data formats in the industry, including Apache Iceberg, and soon Delta Lake and Apache Hudi.
  • A new integrated experience in BigQuery for Apache Spark, an open-source query engine.
  • Expanding integrations with many of the most popular enterprise data platforms to help remove barriers between data and give customers more choice and prevent data lock-in.

Snowflake: “Iceberg Tables: Powering Open Standards with Snowflake Innovations” 

Snowflake recently announced Iceberg Tables to combine Snowflake capabilities with the open-source projects Apache Iceberg and Apache Parquet to solve challenges such as control, cost, and interoperability. With Iceberg tables, companies can benefit from the features and performance of Snowflake but can also use open formats, tools outside of Snowflake, or their own cloud storage. 

To put that into perspective. We just read the announcements of two leading providers of proprietary cloud data warehouses that they are opening their systems. This is remarkable because having customers and their data locked in solutions is an excellent business for those providers.

Why is this happening, and why are players such as Google and Snowflake joining the movement toward an open data ecosystem?

Why we need an open data ecosystem

Digital transformation is held back by challenges that can only be tackled and solved with an open approach. Companies have a significant part of data use cases where proprietary warehouse solutions are not well suited. Those include complex and machine learning use cases such as demand forecasting or personalized recommendations. Companies also require flexibility to adjust quickly to a fast-changing environment and to take full advantage of all their data. Being dependent on the roadmap of a single provider limits the ability to innovate. If a new provider offers a solution that is ideal for your needs or complements your existing solution, you want to be able to take that opportunity. This interoperability and flexibility are only possible with open standards.

On top of that, the current macro-environment forces companies to optimize their spending on data analytics and machine learning, and costs can escalate quickly with proprietary cloud data warehouses. 

The convergence of Data Lakes and Data Warehouses 

We saw that cloud data warehouse providers are moving towards an open ecosystem, joining other companies at the forefront of the movement, such as Databricks and Dremio, among others. They are pushing for the Data Lakehouse approach

In a nutshell, the Data Lakehouse combines the advantages of data warehouses and data lakes. It is open, simple, flexible, and low-cost. It is designed to allow companies to serve all their Business intelligence and Machine Learning use cases from one system. 

Open data formats

A crucial part of this approach are open data formats such as Delta Lake, Iceberg, or Hudi. Those formats provide a Metadata and Governance Layer or, let’s say, the ¨magic¨ to solve the problems of traditional data lakes. Traditional data lakes do not enforce data quality and lack governance. Users can also not work on the same data simultaneously, and only limited metadata is available to provide information on the data layout, which makes loading data and analysis very slow.

.

How Data Lakehouses benefit companies

Companies such as H&M and HSBC have already adopted the open Data Lakehouse approach, and many others will follow. 

H&M, for example, faced the problem that their legacy architecture couldn’t support company growth. Complex infrastructure took a toll on the Data Engineering team, and scaling was very costly. All of this led to slow time-to-market for data products and ML models. Implementing a Data Lakehouse approach, in this case with Databricks on Delta Lake, led to simplified data operations and faster ML innovations. The result was a 70% reduction in operational costs and improved strategic decisions and business forecasting.¹

HSBC, on the other hand, was able to replace 14 databases with Delta Lake. They were able to improve engagement in their mobile banking app by 4,5 times with more efficient data analytics and data science processes.²

So, does the Data Lakehouse solve it all? Not quite; the reality is that some challenges still need to be addressed.

Pending problems

Firstly, the performance of solutions based on open formats is not yet good enough. There is a heated debate ongoing on Warehouse vs. Lakehouse performance, but I think it’s fair to say that, at least in some use cases, the Lakehouse still needs to catch up. Data Warehouses are optimized for the processing and storage of structured data and are very performant in those cases. For example, if you want to identify the most profitable customer segments for the marketing team based on the information you collected from different sources.

Secondly, working with open formats is complex, and you need a skilled engineering team to build and maintain your data infrastructure and ensure data quality.

How Qbeast supports the open data ecosystem

At Qbeast, we embrace the open data ecosystem and want to do our part to push it forward. We developed the open-source Qbeast Format, which improves existing open data formats such as Delta Lake.

We enhance the metadata layer and use multi-dimensional indexing and efficient sampling techniques to improve performance significantly. Simply put, we organize the data smarter so it can be analyzed much faster and cheaper.

We also know that data engineering is a bottleneck for many companies. Serving the data requirements for Business Intelligence or Machine Learning use cases can be tricky. Data needs to be extracted, transformed, and served correctly. Developing and maintaining these ETL processes is a considerable challenge, especially when your engineering power is limited. At Qbeast, we built a managed solution to ensure those processes run smoothly. We handle data ingestion and transformations and ensure data quality. We make sure that the data layout is optimal for consumption so that the tools you use for BI or ML run in the most efficient way possible. This means that we not only help to break the engineering bottlenecks but that we also help companies to realize significant cost savings.

We use open-source formats and tools, so we make sure to help companies with the latest and best tools available in the open data ecosystem.

An open data ecosystem is the future

We are extremely excited to see the industry moving towards an open data ecosystem, and we are convinced that it is the future. As Sapphire Ventures points out in their blog, the benefits for customers are clear: cost-effectiveness, scalability, choice, democratization, and flexibility. 

At Qbeast, we are dedicated to accelerating this transition and supporting an ecosystem that enables companies to pick the right tools from the best providers without worrying about compatibility and switching costs. To power true innovation.

References

About Qbeast
Qbeast is here to simplify the lives of the Data Engineers and make Data Scientists more agile with fast queries and interactive visualizations. For more information, visit qbeast.io
© 2020 Qbeast. All rights reserved.
Share:

Back to menu

Continue reading

I-BiDaaS Project with Qbeast Completed With Great Success

I-BiDaaS Project with Qbeast Completed With Great Success

March 26, 2021

The Barcelona Supercomputing Center(BSC) was a member of the I-BiDaaS consortium, and Qbeast’s solution took part in the I-BiDaaS platform as BSC’s contribution. Qbeast’s visualization tool Qviz, which will be part of the Qbeast Platform product, was used in the banking use cases(CaixaBank) that were conducted during the project.

Industrial-Driven Big Data as a Self-Service Solution (I-BiDaaS) an EU-funded project that aimed to encourage IT and non-IT big data experts to easily apply and collaborate with big data technologies by developing, creating, and demonstrating a unified solution that significantly increases data analysis speed while coping with the pace of data asset development, and promotes cross-domain data-flow towards a thriving data-driven EU economy.

The vision of I-BiDaaS was to shift the power balance within an organization, improving efficiency, lowering costs, generating greater employee empowerment, and increasing profitability. To create a stable environment for methodological big data exploration in order to develop new products, services, and technologies. To build innovations that will boost the productivity and competitiveness of all EU companies and organizations that deal with large, complex data sets.

The I-BiDaaS project was successfully launched in January, 2018 with 13 participating organizations from 8 different countries and the duration of the project lasted for 36 months.

Qbeast, as part of the I-BiDaaS tools, was tested and analyzed in the context of fraud detection in the use cases of advanced analysis of bank transfer payment in financial terminal and enhance control of customers to online banking. Qbeast was credited with a 30% reduction in data processing time and a potential cost reduction in commercial data analytics solutions licenses.

CaixaBank concluded: the most important conclusion of the use case was the ability to perform big data clustering analytics in a very agile way, based on existing or custom-tailored clustering algorithms.”

I-BiDaaS tools were validated for the full cycle of big data processing, as a self-service for non-IT and intermediate users, with advanced users able to customize their big-data analysis.

“One of the key gains Qbeast has obtained from the I-BiDaaS project is clearly the close contact we have had with the industry. Having CaixaBank, Telefónica I+D and Centro Ricerche Fiat in the project, and being able to work with them so closely, has had a paramount impact on how Qbeast is now, and how Qbeast will be shaped in the future.” said Raül Sirvent, Principal Investigator for BSC during the I-BiDaaS project and Senior Researcher at the Department of Computer Science, BSC.

For further information on I-BiDaaS, please visit the I-BiDaaS website.

About Qbeast
Qbeast is here to simplify the lives of the Data Engineers and make Data Scientists more agile with fast queries and interactive visualizations. For more information, visit qbeast.io
© 2020 Qbeast. All rights reserved.
Share:

Back to menu

Continue reading

Qbeast taking part in IESE’s MBA BTTG program

Qbeast taking part in IESE’s MBA BTTG program

March 01, 2021

Qbeast was chosen to participate in the Barcelona Technology Transfer Group (BTTG) program, an initiative of IESE Business School that MBA students introduced in 2016. According to The Economist’s WhichMBA? 2021 Full-Time List, IESE’s MBA program has been ranked as the best MBA program in the world.

BTTG’s objective is to create a platform that connects inventors from research and development labs with the business acumen and talent they need in order to bring their product successfully to market.

The goal of the program is to help promote the market entry of creative new technologies, giving the greatest potential for sustainable growth and therefore have a positive impact on society and technological development. It also gives MBA students a first-hand experience of working in a start-up environment that offers invaluable training experience and facilitates relationships between IESE, it’s MBA students and the local startup ecosystem.

The program is a three-month collaboration between Qbeast and an IESE appointed team, including a group of five MBA students who provide the company with quality work while producing several ideas, and a project leader responsible for task planning and project management. The students partner up with the Qbeast team to assist with the business strategy, define market fit and segmentation and determine suitable commercialization models and sales channels.

“The idea: put scientists and MBA students in a room together, and see what happens. The answer has been that a lot happens.” said Luca Venza, founder of the BTTG and Director of Technology Innovation, Transfer and Acceleration, IESE. “The program is proof that business and science can coexist successfully, if the conditions are right and the mutual respect is there.”

For further information on BTTG, please visit the BTTG website.

About Qbeast
Qbeast is here to simplify the lives of the Data Engineers and make Data Scientists more agile with fast queries and interactive visualizations. For more information, visit qbeast.io
© 2020 Qbeast. All rights reserved.
Share:

Back to menu

Continue reading

How Qbeast solves the pain chain of Big Data analytics

Are you ready to find out how speeding up data analysis by up to 100x solves data teams’ pain points?

Well, first let me give you some background information. According to a survey conducted by Ascend.io and published in July 2020, 97% of data teams are above or at work capacity.¹ Given that every day more and more data is generated and stored, this is not good news for data teams and organizations. Yet, the capability to leverage data in business has never been more critical.

The pain chain

The survey states that the ability to meet data needs is significantly impacted by slow iteration cycles in data teams. This aligns with the feedback that we received from our customers’ data teams as well.

To explain why iteration cycles are slow, let’s use the concept of the pain chain. The pain chain was first introduced by Keith M. Eades and is a map to describe a sequence of problems in an organization’s process.² The pain of one role in the company causes the pain of another function. In our case, the data pain chain starts with the Data Engineer, follows to the Data Scientist, and finally involves the decision-makers. To keep in mind, the data engineer is the one who prepares the data. The data scientist uses this data to create valuable and actionable insights. And well, the decision-maker is a project manager, for example, who wants to get a data-driven project done.

The survey found that data scientists are the most impacted by the dependency on others, such as data engineers, to access the data and the systems (48%). On the other hand, data engineers spend most of their time maintaining existing and legacy systems (54%).

How does this impact the decision-maker? Well, it leads to a significant loss of value due to delayed implementation of data products or because they cannot be implemented at all.

How do we solve it

Qbeast’s solution tackles the pain chain on several fronts to eliminate it altogether.

Front 1: Data Engineering

There is nothing more time consuming and nerve-racking than maintaining and building complex ETL pipelines.

Less complexity and more flexibility with an innovative storage architecture

Can’t we just work without ETL pipelines? You may say yes, we can use a data lake instead of a data warehouse. We can keep all the data in the data lake and query it directly from there. The downside? Querying is slow and processing all the data is expensive. But what if you could query all the data directly without sacrificing speed and cost?

With Qbeast, you can store all the data in your data lake. We organize the data so that you can find what exactly you are looking for. Even better, we can answer queries by reading only a small sample of the dataset. And you can use your favorite programming languages, be it Scala, Java, Python, or R.

How do we do this? With our storage technology, we combine multidimensional indexing and statistical sampling. Check out this scientific paper³ to find out more.

Our technology’s advantage is that we can offer superior query speed than data warehouses while keeping the data lakes’ flexibility. No ETL pipelines but fast and cost-effective. The best of both worlds, so to speak.

Front 2: Data Science

We know that if you are a data scientist, you do not care so much about pipelines. You want to get all the data you need to tune your model. And it is a pain to rely on a data engineer every time you need to query a large dataset. You are losing time, and you can’t focus on the things that matter. But what if you could decide the time required to run your query yourself?

Data Leverage

By analyzing the data with a tolerance limit, you can decide how long to wait for a query and adjust the precision to your use case. Yes, this means that you can run a query on whatever you want. Do you want to know the number of sales in the last months? Full precision! But do you really need to scan your whole data lake to see the percentage of male users? Probably not.

With Qbeast, you can get the results you need while accessing only a minimum amount of available data. We call this concept Data Leverage. With this option, you can speed up queries by up to 100x compared to running state-of-the-art query engines such as Apache Spark.

Conclusion

A storage system, which unites multidimensional indexing techniques and statistical sampling, solves the data analytics pain chain by speeding up queries, reducing complexity, and adding flexibility. This results in a significant speed-up of iteration cycles in data teams. Increased productivity and speed of data analysis itself have a colossal impact on the ability to meet data needs and to create superior data products. And above all, alleviating the pain chain results in a happy data team, decision-makers, and customers.

But the pain chain doesn’t end here! Now it is time for the application developers to pick up all the insights uncovered by the data scientists and use them to build amazing products! That’s a topic for another post, but I bet you have guessed; we have a solution for that too.

References

1. Team Ascend. “New Research Reveals 97% of Data Teams Are at or Over Capacity”, Ascend.io, 23 July 2020, New Research Reveals 97% of Data Teams Are at or Over Capacity. Accessed 28 December 2020.

2. Eades, Keith M., The New Solution Selling: The Revolutionary Sales Process That is Changing the Way People Sell, McGraw-Hill, 2004.

3. C. Cugnasco et al., “The OTree: Multidimensional Indexing with efficient data Sampling for HPC,” 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 2019, pp. 433–440, doi: 10.1109/BigData47090.2019.9006121.

For further information on BTTG, please visit the BTTG website.

About Qbeast
Qbeast is here to simplify the lives of the Data Engineers and make Data Scientists more agile with fast queries and interactive visualizations. For more information, visit qbeast.io
© 2020 Qbeast. All rights reserved.
Share:

Back to menu

Continue reading

© 2020 Qbeast
Design by Xurris