Leveraging Data Virtualization for Ad-hoc SQL Queries and BI

Kaycee Lai
Oct 19, 2020
5 min read

Updated: Dec 7, 2020

Virtualization is a pillar of modern computing which makes new, more advanced things possible by shielding users from underlying complexities of IT architecture. It involves using software to create a “virtual layer” of simplification over very complex architecture, and is used by IT to do more with less.

One of the most important areas where virtualization plays today is in simplifying the mayhem otherwise known as “data infrastructure”. In the not-too-distant past, data infrastructure was typically housed within a reasonably-sized on-premise datacenter and accessed with a single BI tool. As data grew and cloud storage came along, data infrastructure became dispersed across numerous databases, data warehouses and data lakes, all operated by different vendors and physically housed in varied geo-locations. Concurrently, more BI tools came along which were adopted by different departments within the same company. Consequently, the modern data infrastructure is a hodge-podge of data siloed in Hadoop, Snowflake, S3 buckets, Oracle, etc., accessed by different departments using Tableau, Power BI, Looker, and other tools.

A Case Study in the Need for Data Virtualization: Building Your Next-Gen Product

Let’s say your company is looking to produce the next version of its flagship product, and needs to know what modifications need to be made in both aesthetic design, and the structural design. Many of the questions that you’d need to ask can only be answered by cross-analyzing data from different data sources.

Problem 1: Those data sources likely reside in different locations, and are managed by different vendors. For instance, you’d want to look at customer feedback and other data which might reside in CRM systems, as well as manufacturing data, which might reside in other BPS systems.

Problem 2: Those data sources probably use different formatting. For instance, they might use different conventions for describing the same locations (“Ms” vs. “Mississippi”) or refer to products with slightly different formats. Even though these can be deciphered by human beings, they might render the data useless to your BI tools.

How do you solve this problem? Virtualization

Virtualization Brings Order to the Chaos

This creates a nightmare for a data analyst who must spend huge amounts of time--often the majority of the project’s hours--looking for data. On the outset it might seem like this can be solved by simply enforcing better data management. However, it’s not feasible to consolidate everything in a single data warehouse. You’re never going to stop buying newer, better solutions, so there will always be the need to move and copy data to new systems--this takes time and money, and isn’t feasible as data grows.

Data virtualization, on the other hand, deals with the mess by inserting a layer on top of it that makes it appear simple and navigable to the user. It means that you don’t need to move the data in order for your BI tools to access it. It enables you to skip the long process of ETL (extract, transform, load), and facilitates query-in-place so that tables can be generated on the fly without actually moving anything. If that sounds a little magical, perhaps that’s why one of the biggest tools in data virtualization is named “Presto”.

Starburst Brings Presto and Data Virtualization to the Masses

Presto, which has become synonymous with data virtualization, started off at Facebook. They couldn’t live with the dismal performance of a single SQL query engine on numerous data warehouses, databases and data lakes, so they developed one that operated over a virtualized data infrastructure. It worked so well that it has since been adopted by Twitter, Uber, Airbnb and other data-driven companies.

However, Presto was still complicated to implement, and it wasn’t until Starburst came up with a way to offer enterprise support that the idea of SQL queries across a virtualized data infrastructure became widely available to companies. Starburst, with Presto under the hood, allows you to connect and do query aggregation on the fly, avoiding the ingestion loading and heavy wrangling otherwise associated with analytics in a complex data infrastructure.

A Few missing Pieces to the Data Virtualization Puzzle

Starburst/Presto doesn’t solve every problem, however. For example, you still need to:

Know what question to ask
Capture the question to figure out where the data is
Verify that the data exists and is in usable form
Assemble the data
Write the SQL query (including all the JOINs)

Once all this happens, the query can be executed, via Presto, across the virtualized distributed data system. However, getting to this point can be a maddening, dead-end-ridden process of navigating potentially dozens of versions of the same data, each with inconsistencies and variations that make it impossible to get to a ‘single version of the truth’.

Semantic Search Fills in the Gaps in Data Virtualization

Used in conjunction with the right tools, however, this complicated process can be simple and seamless. The key is to also abstract the portion of the query process that involves the above steps--capturing the question, verifying and assembling the data, writing the complex SQL query--so that it is executed in a manner similar to a ‘simple’ Google search. Note that the quotes around ‘simple’, as in reality it’s one of the most technically challenging feats of the modern era. Yet, it has been abstracted to the point where even the most tech-illiterate user can get the information they need within seconds.

The same principle that enables Google search can be applied to enterprise data hidden in the deepest, darkest crevices of the messiest data infrastructure. Using NLP, the user can enter a question in natural language, and the system shows them all the best options for answering the question, and helps them--via an interactive visual map--assemble the right components to quickly get answers.

To the analyst, it doesn’t matter where the data is located, or what BI tools they’re using--as long as they know the question, within minutes they can get the data needed to provide the answers. Everything is fully virtualized--they don’t even have to know that Presto exists, or that the “table” they’re looking at is actually data residing in far removed sections of your infrastructure. On top of that, this ultimate form of virtualization means that no data ever has to actually move for analysis to be performed (which is also a huge advantage for compliance/governance). The data stays put, and the only thing that moves (forward) is the company’s knowledge and understanding.

Weighing the Options

Let’s go back to our original example, and imagine you need to analyze data from totally separate sources to inform your next-gen product design, you’ve got a few options for dealing with this:

Option 1: The route that most companies are currently taking is to ask IT to manually load the data, manually transform it, and manually prep it for analysis. At best, you can kiss four months of your life goodbye while you wait. At worst, it won’t happen at all, or by the time it does happen, your question will have changed.

Option 2: OR, you can use data virtualization to query data without having to do all those things in Option 2) Hello insights!

Learn more about how Promethium works with Starburst to bring data virtualization to the most complex data infrastructure.