Success Case: How Qbeast Overcame Key Challenges of Cybersecurity Analytics
June 8, 2023
Use case: Data Analytics
Qbeast solution: Qbeast Format + Cloud Development
Download the One Pager here
Cybersecurity is a never-ending race to catch up with the latest threats. The only way to stay ahead of the attackers is to be as data-driven as possible and to keep innovating to remain at peace with the most performant technologies available.
However, as the amount of data grows, so do the challenges they have to face:
- Data Layout: how to store massive volumes of data minimizing the friction of maintenance.
- Data Quality: Ensuring the accuracy, completeness, consistency, and timeliness of data is crucial for reliable analysis.
- Data Security and Privacy: Protecting sensitive data from unauthorized access, breaches, or loss is paramount for organizations.
- Data Skills: High demand for these skills makes it challenging for organizations to attract and retain the right talent.
Our clients felt the pain of these challenges in their day-to-day life and contacted Qbeast to help them address these problems. In the next sections, we are going to explain how Qbeast technology can overcome each of the situations minimizing the operation cost.
Data Layout: organize and keep data in shape.
For a cybersecurity company, analyzing the information at any point in time on a particular webpage is critical to respond faster to events. But, due to the high volume of data, it is like searching for a needle in a haystack. Qbeast Format’s premise is to organize the data more efficiently to retrieve results faster.
To help solve our clients’ priorities to study the data, we want to group the records by
Domain (web page URL).
In traditional partitioning systems, we will store separated files for a single day with a wide range of domains in it. Loading all these files could overload the application or take hours to complete. This means that the records are not grouped in the optimal state for doing a fair analysis.
With Qbeast Format, we will have one single file for May 9, and the domain
"yourdomain". If the user wants to visualize the potential attacks of
"yourdomain" on the last day, it would only need to read a single file.
As a dataset grows in size and new files are added, a process needs to be run to keep the Data Lake in an optimal shape; the compaction. This operation is done as a side process and must be programmed and configured each time the performance experiences a slowdown.
One of the specialties of Qbeast is to keep the query time constant as the dataset grows. With other formats, the data will always need to be reorganized or repartitioned while Qbeast can maintain the order for a much longer time while writing new information. In the case distribution changes or a lot of small files are appended, the layout would be automatically optimized by detecting bad pattern behaviors.
Quality: data is clean and ready at any time
When we talk about Data Quality, we refer to the fact that data is understandable and ready to use at any time by any agent.
In the case of our client, we store different Tables that aim for different scopes, and we keep them updated day to day. These tables are stored in the Glue Catalog and can be queried seamlessly through Athena, Power BI, Databricks, Snowflake, or Google Big Query among others.
For manipulating Tables and performing quality checks, we use DBT (data build tool). It is a popular transformation technology that allows you to transform the data using simple select statements, effectively creating your entire transformation process with code. You can write custom business logic using SQL, automate data quality testing, deploy the code, and deliver trusted data with documentation side-by-side.
Security and Privacy: keep data protected from possible attackers.
We ensure that all resources assigned to each user are contained within their individual spaces, preventing any overlapping or information leaks between users.
Qbeast restricts access to the infrastructure to only the DevOps engineers that need to ensure the system runs smoothly. In our pledge to higher security, we are adopting the industry’s best practices and are in the process of passing the SOC 2 audits.
The customer integration mechanism and the internal best practices help ensure that access to sensitive data is secure and protected.
Skills: simplify engineer’s life.
Qbeast Format allows the possibility to work on samples. Samples are just a fraction of the data. 1%, 10%, 20%… This percentage allows the user to have an overview of the information stored, being able to navigate faster through query results.
Going back to our client use case, we have seen that we store the information regarding Date (May 9) and Domain (“yourdomain”) in one single file. Qbeast also writes a representative percentage of the data in this single file.
In the same way that files are filtered for an equality search, the Format can read only those files that contain the necessary percentage. This allows to build, test and publish data pipelines at high-speed using just a fraction of the information.
At the end of the day, companies want to keep innovating without compromising their budget. All the features that Qbeast brings to the Data Lake (layout, quality, security, and simplification) are intended to be cost-efficient.
In the case of our client, they run three types of workloads on the data.
- Unique domain search.
- Time and domain filtering.
- Developing on samples.
For the first workload, we’ve been able to run a test on part of the dataset (due to the high volume of data) written in Qbeast and in Delta. Results show that Qbeast reduces the Cloud Cost of the client by 98% compared to Delta.
Qbeast vs Delta
For 2) and 3), we use the Amazon Price Page and the times of 1) to calculate the approximate cost of each workload by multiplying the cost of loading a file by the number of files needed to be read. The results show up to 97% of cost reduction.
Qbeast vs All Data
Annex: Cost Calculation
- Workers: 10x r6g.4xlarge SPOT
- Workers: 2x r6g.4xlarge ON DEMAND
- Master: 1x r6g.2xlarge ON DEMAND
Costs & time:
- LIST cost: 0.000005 USD x request
- GET cost: 0.0000004 USD x request
- Cluster running cost per hour = 1*0.4512 + 2*0.9024 + 10*0.396 = 6.216 USD/hour
- Cluster running cost per minute = 6.216 USD/hour / 60 minutes = 0.1036 USD/minute
- Time for reading a single file = 131 files / 60 seconds = 2.18 seconds
Delta vs Qbeast vs Parquet matrix:
|Format||Time||Number of files read||Total Size (GB)||Storage Cost ($)||Total Cost ($)
(compute + storage)
|Qbeast Format||1 min||131||59.5||1 LIST + 131 GET requests x 0.0000004 USD per request = 0.0001 USD||6.216*1/60min + 0.0001 = 0.1037 USD|
|Delta Lake||38 min||1297||1174.6||1 LIST + 1,297 GET requests x 0.0000004 USD per request = 0.0005 USD||6.216*38/60min + 0.0005 = 3.9373 USD|
|RAW Parquet||–||53161050||7353||530K LIST + 53M GET = 21.2644 USD + 2.658 USD = 23.92 USD||–|
Want to know more?
Book a call with Paola to know how we can help your company: https://calendly.com/paolapardo/30min