Are you ready to find out how speeding up data analysis by up to 100x solves data teams’ pain points?
Well, first let me give you some background information. According to a survey conducted by Ascend.io and published in July 2020, 97% of data teams are above or at work capacity.¹ Given that every day more and more data is generated and stored, this is not good news for data teams and organizations. Yet, the capability to leverage data in business has never been more critical.
The pain chain
The survey states that the ability to meet data needs is significantly impacted by slow iteration cycles in data teams. This aligns with the feedback that we received from our customers’ data teams as well.
To explain why iteration cycles are slow, let’s use the concept of the pain chain. The pain chain was first introduced by Keith M. Eades and is a map to describe a sequence of problems in an organization’s process.² The pain of one role in the company causes the pain of another function. In our case, the data pain chain starts with the Data Engineer, follows to the Data Scientist, and finally involves the decision-makers. To keep in mind, the data engineer is the one who prepares the data. The data scientist uses this data to create valuable and actionable insights. And well, the decision-maker is a project manager, for example, who wants to get a data-driven project done.
The survey found that data scientists are the most impacted by the dependency on others, such as data engineers, to access the data and the systems (48%). On the other hand, data engineers spend most of their time maintaining existing and legacy systems (54%).
How does this impact the decision-maker? Well, it leads to a significant loss of value due to delayed implementation of data products or because they cannot be implemented at all.
How do we solve it
Qbeast’s solution tackles the pain chain on several fronts to eliminate it altogether.
Front 1: Data Engineering
There is nothing more time consuming and nerve-racking than maintaining and building complex ETL pipelines.
Less complexity and more flexibility with an innovative storage architecture
Can’t we just work without ETL pipelines? You may say yes, we can use a data lake instead of a data warehouse. We can keep all the data in the data lake and query it directly from there. The downside? Querying is slow and processing all the data is expensive. But what if you could query all the data directly without sacrificing speed and cost?
With Qbeast, you can store all the data in your data lake. We organize the data so that you can find what exactly you are looking for. Even better, we can answer queries by reading only a small sample of the dataset. And you can use your favorite programming languages, be it Scala, Java, Python, or R.
How do we do this? With our storage technology, we combine multidimensional indexing and statistical sampling. Check out this scientific paper³ to find out more.
Our technology’s advantage is that we can offer superior query speed than data warehouses while keeping the data lakes’ flexibility. No ETL pipelines but fast and cost-effective. The best of both worlds, so to speak.
Front 2: Data Science
We know that if you are a data scientist, you do not care so much about pipelines. You want to get all the data you need to tune your model. And it is a pain to rely on a data engineer every time you need to query a large dataset. You are losing time, and you can’t focus on the things that matter. But what if you could decide the time required to run your query yourself?
By analyzing the data with a tolerance limit, you can decide how long to wait for a query and adjust the precision to your use case. Yes, this means that you can run a query on whatever you want. Do you want to know the number of sales in the last months? Full precision! But do you really need to scan your whole data lake to see the percentage of male users? Probably not.
With Qbeast, you can get the results you need while accessing only a minimum amount of available data. We call this concept Data Leverage. With this option, you can speed up queries by up to 100x compared to running state-of-the-art query engines such as Apache Spark.
A storage system, which unites multidimensional indexing techniques and statistical sampling, solves the data analytics pain chain by speeding up queries, reducing complexity, and adding flexibility. This results in a significant speed-up of iteration cycles in data teams. Increased productivity and speed of data analysis itself have a colossal impact on the ability to meet data needs and to create superior data products. And above all, alleviating the pain chain results in a happy data team, decision-makers, and customers.
But the pain chain doesn’t end here! Now it is time for the application developers to pick up all the insights uncovered by the data scientists and use them to build amazing products! That’s a topic for another post, but I bet you have guessed; we have a solution for that too.
1. Team Ascend. “New Research Reveals 97% of Data Teams Are at or Over Capacity”, Ascend.io, 23 July 2020, www.ascend.io/news/company-announcements/new-research-reveals-97-of-data-teams-are-at-or-over-capacity. Accessed 28 December 2020.
2. Eades, Keith M., The New Solution Selling: The Revolutionary Sales Process That is Changing the Way People Sell, McGraw-Hill, 2004.
3. C. Cugnasco et al., “The OTree: Multidimensional Indexing with efficient data Sampling for HPC,” 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 2019, pp. 433–440, doi: 10.1109/BigData47090.2019.9006121.