Why We Need Data Silos, and How We Can Work With Them Effectively
For data engineers, one of the biggest challenges involves the fact that they’re usually dealing with data infrastructures in which data is ‘siloed’ between different entities, making it difficult to get the necessary insights to answer important questions. For this reason, ‘data silos’ have become something of a dirty word within the data management, analytics and storage industries.
Want to learn how to enable self-service analytics across multiple data sources without moving data? Check out this whitepaper from Eckerson, The Road to Self-Service Bliss
As such, the conventional wisdom is that they need to be eliminated. But should data silos really be considered a mere nuisance which, when removed, will allow companies to finally move forward as data-driven organizations? In this article we’ll discuss why data silos are actually necessary, as well as alternate approaches that allow organizations to retain their benefits while working around the difficulties they pose.
Data Silos Serve a Purpose
Consider a parallel trend that’s taken hold in Silicon Valley--the open workspace. Companies need to collaborate better--that’s a given. Walls, so the thinking goes, separate people. So, if you get rid of walls in an office space, we’ll all be one happy family, collaborating Gen Z style. Of course it hasn’t really worked out like that. Walls, it turns out, serve many legitimate and critical purposes. They allow confidential things to be discussed in private, allow people to focus intensely on things that require prolonged periods of concentration, and give introverts a break from constant interruptions.
Likewise, data silos are not just a mistake to be rectified. They exist for many reasons, several of which are outlined by Bin Fan and Amelia Wong in 97 Things Every Data Engineer Should Know, including:
Datasets have varying characteristics, ranging from transactional data to IOT data streaming from sensors on equipment to social media chatter data. Often the collection and storage procedures warrant different repositories, have different processing requirements, and other distinctive elements.
Some datasets are more critical or time-sensitive to the business than others, and thus are stored in different tiers. It doesn’t always make sense to store less critical data in expensive storage schemes.
Data has been acquired at different phases of the company’s development. Newer data may be stored in more recently developed repositories, and it may not be feasible to ETL older datasets into them.
Some datasets are subject to tighter regulation and have special storage requirements--once again, it doesn’t always make sense to store all of the organization’s data in the same repository, particularly if their requirements are exacting or expensive.
Different departments may have different specifications for their data. Marketing leaders know things about marketing data that no one else knows, for example. It doesn’t always make sense for data to be managed at the macro level.
So, it’s clear that a one-size-fits all approach may not be the best--particularly for large organizations with multiple data types and a wide range of requirements. Knocking down all the silos--at least in the literal sense--in many cases, wouldn’t make any more sense than eliminating the silos in a grainery. Throwing all the data into a giant data warehouse or Hadoop would not only be impractical from a cost perspective, but may have some very negative repercussions.
What’s required is a data mesh-styled method of allowing people across organizations to get to all the data within the architecture as it currently exists, without duplicating data or adding more layers of complexity.
Coexisting with data silos by using a Data Fabric
We can learn to peacefully coexist with our data silos by implementing a concept known as the Data Fabric--a layer of abstraction between storage repositories and the users who need to access the data, that allows them to view and access it as a single repository. It’s not really a single repository-- data that needs to be physically separated is still held in distinct storage schemes. But, the access is standardized through virtualization.
In such an environment, data engineers can pull data from multiple sources into the same analytics platform. Data from Oracle might be joined with data in Hadoop to provide a single calculation. At the same time, the data engineering team does not need to micromanage every data source--that can be managed at the department level by people with the domain expertise necessary to understand the nuances of how it needs to be handled. The optimum storage decisions can be made without worrying about how it might impact the work of analysts.
If you want to learn more about how a data fabric can accomplish all of this, read on.