A Data Fabric: Should I Build or Should I Buy?
Of all recent developments in data management, the data fabric may be the most compelling for a number of reasons. For example, it promises the ability to connect all of an organization’s data in a single unified environment, so that any team--regardless of department or location--can collaborate on data projects. Additionally, it promises to do so without requiring ETL, or even its easier twin of ELT (extract, load, transform). The data can be located, integrated with other data sources, and analyzed without ever leaving its original source location. Furthermore, it promises to make the process of locating the right data sources and combining them into an analysis-ready dataset intuitive and accessible to even non-technical users.
If data fabrics are so great, why doesn’t everyone have one? This lack of adoption is at least partially due to a perception that a data fabric must be custom-built for an organization. Proponents of this point of view argue that, while there are plenty of solutions out there touting the moniker ‘data fabric’, most of them don’t yet provide the components of a true comprehensive fabric.
Is this a fair view? No, for reasons we’ll discuss shortly. But it has big implications, because currently many organizations are either engaged in a multi-year, multi-million dollar project to build their own data fabric, or they’re simply ignoring its potential and sticking with the insufficient data architecture they’ve currently got in place.
First, let’s talk about what building a data fabric would take. Then, let’s talk about the minimum requirements an off-the-shelf enterprise data fabric solution would require for it to be worth buying vs. building.
Critical components for a data fabric
A data fabric relies heavily on augmented data management and automation at all levels of design. Here are a few of the core components for any system that qualifies as a data fabric:
Metadata. Information that describes data assets, or ‘metadata’, is the backbone of a data fabric, as it allows the system to make sense of a data architecture, and make it accessible to human beings. A fully functioning data fabric will include the ability to collect and make available technical metadata that describes the properties of data sources such as format, data types, and access protocols; business metadata such as labels and classification; social metadata such as tags and annotations; and operational metadata describing how data has been used.
Processing layers. From an operational standpoint, a data fabric provides connections to data sources, manages workflows, offers a unified abstracted or virtual view of all data assets, all while supporting governance by identifying and controlling access to sensitive data assets.
Application layers. The data fabric integrates applications that provide capabilities such as data discovery, data cataloguing, and data prep, with the ultimate goal of making it easier for teams to locate data and collaborate on analytics across all kinds of departments and business silos.
AI/Machine Learning. A data fabric must be ‘smart’, meaning that it leverages AI and machine learning to continually learn from the data architecture. As its understanding improves, it is able to automate certain functions of the processing and application layers, making the process of locating, assembling and analyzing data much faster.
It’s clear from only a surface-level exploration that creating a data fabric, even in its MVP (minimally viable product) form, would require extensive resources and expertise, an astronomical budget, and an indefinite amount of time. The task of creating a system that intelligently handles multiple types of metadata, alone, would take years and cost $millions.
Assuming you’re not IBM, Google, or Facebook, you probably don’t have access to these resources or the kind of budget required. If, however, you were to explore the option of an off-the-shelf data fabric, what would it need to make real improvements to your ability to manage data? At what point would it cross over from simply being a glorified data catalog to being a true fabric that expands and grows with your organization’s data?
Minimum requirements for an off-the-self data fabric
For a data fabric to be viable for an organization, it must provide certain features and benefits.
Ability to handle passive and active metadata. AI-/machine learning would need to be employed to accumulate, organize, and learn from passive metadata, or the metadata that was initially applied to each dataset. Furthermore--and this is one of the key differentiators between a fabric and a simple catalog--it would need the ability to actively apply metadata based on user behavior and input as the system evolves. As it does so, it discovers relationships and potential integration points between datasets that exist in different repositories, and catalogs these for future reference.
Ability to integrate data through all standard data delivery methods. A modern organization’s data comes from ETL, replication, messaging, virtualization and microservices, to name a few. A minimally viable data fabric needs to handle all of these, with very little setup required.
Ability to connect with all major data repositories. Similarly, a modern organization’s data repositories range from relational and NoSQL databases to data warehouses, data marts and data lakes like Hadoop. A data fabric must afford the ability to easily connect to all of these.
Capabilities for visualizing relationships between data sources in a user-friendly manner. Many analytics questions require data from multiple sources. A viable data fabric must make it easy to identify and visually assemble them.
Unified data access. Authorized users from any department need the ability to access data from any part of the organization, without having to copy it or move it.
Semantic search. Users need to be able to search for datasets just like they might search for a local movie theater on Google.
In sum, most organizations will find that attempting to build a data fabric is a quagmire. But, commercial solutions may fit the bill if they meet, at minimum, the above requirements. Furthermore, a solution that fits these requirements will provide the following advantages over building a custom platform:
provides numerous advantages over attempting to internally build a data fabric, including:
Up and running faster. Implementation time is dramatically reduced, from years to days and in some cases minutes.
Less time spent on integration services. Much less need to build custom integrations to data sources or integrate tools from multiple vendors, which may be incompatible and require significant programming gymnastics.
Reduce Total Cost of Ownership (TCO). Much less need to build or maintain software, and also a reduction in the number of tools required to manage and analyze data.
If you’re interested in trying out a commercial data fabric solution that exceeds these requirements, read on.