Moving data is no trivial undertaking. Think about any complex system: a school, a company, even your house. Simply keeping it running in normal form is challenging enough. Attempting to pick it up and move it to a new location, however, is a greater challenge by orders of magnitude. Whether you’re moving corporate headquarters or just moving your family to the next town over, to do it right you need to be incredibly organized 1) before the move, as you’re prepping everything; 2) while everything is being transported (think about all the things that can go wrong) and 3) upon arrival, so that everything can be put in its proper place. The same principles apply to moving data.
Generally data is moved for reasons related to analytics, governance, or systems upgrade (which might be related to security, cost savings or other reasons). For the purposes of this blog, we’ll focus mostly on analytics related reasons. We’ll discuss the primary process for moving data, ETL (Extract, Transform, Load), as well as an alternate paradigm that has emerged in the form of the Data Fabric.
ETL: A brief overview
Extract, transform, load (ETL) describes the process of copying data from a single or multiple repositories into a new centralized system--such as a data warehouse, data mart, or operational data store (ODS).
The extraction phase (the ‘E’ in ETL) involves gathering the data from different sources which may use different data schemas and formats, and may also be owned and guarded by different departments within your organization. Following the extraction phase, the Transformation phase (the ‘T’ in ETL) involves applying rules to prepare it for its new location. This may involve selecting the right columns to load, ensuring everything is consistently coded, aggregating data, and identifying duplicate data, to list only a few items.
During the ‘Load’ phase (the ‘L’ in ETL) the data is loaded into its new repository. This is often a recurring process that happens in hourly, daily, weekly, or monthly intervals, depending on business needs (such as the need to analyze very new data). At this time the rules defined in the transformation phase are applied.
Done right, ETL gives you data in a ready-to-use state for analysts, developers and department decision-makers. But it also presents numerous challenges
The Challenges of ETL
Many problems can occur during ETL. For a more complete list, read our blog What is ETL?. But as food for thought, consider the following.
For instance, the range of data values the system needs to handle may be wider than originally anticipated, and may require updated validation rules. Additionally, scalability may be an issue, and systems that are initially only required to process gigabytes of data may need to eventually handle much larger workloads. And if there are multiple input sources, a source that is quickly extracted and transformed may be delayed by a slower source. Inevitably, due to the strict requirements of ETL, some data is not fully accepted into the new system.
Furthermore, it presents the risk of having multiple versions of the same data, negating the data engineering objective of maintaining a ‘single version of the truth’.
And consider the difficulty of selecting the proper tool for the job. An effective ETL tool requires: the ability to work with a wide range of data vendors and file formats; AI and machine learning capabilities such as the ability to assess data quality and assign metadata; and ability to maintain a ‘lineage’ of all changes to the data over time.
As a result of these and other challenges, most organizations find it's not possible to move all enterprise data to a central repository for analytic purposes. And moving what can be moved takes months, possibly years. Furthermore, once data is moved, the rigidness of the process makes it very difficult to implement changes that are inevitably required by shifting needs.
Data Fabric: why move the data if you don’t have to?
Since a primary objective of ETL is to have data ready for analytics, when the process becomes so involved that it actually slows the analytics cycle it may be time to consider an alternate approach. A data fabric provides a virtual layer that connects the entirety of an organization’s data, along with all of the processes and platforms that might be connected to it, using machine learning (ML) and artificial intelligence (AI) to make sense of and apply structure to all of it.
The abstracted ‘fabric’ means that you don’t have to move data in order to access it or integrate it with the system. It overlays existing architecture, allowing you to use the original data where it currently exists. This brings significant reductions in the cost and risk associated with moving or copying data into warehouses and other repositories. Additionally, it affords a degree of agility and scalability that you don’t have with ETL. When a new source arises, you simply connect it to the data fabric.
In summary, a data fabric allows data to be accessed where it lives, while the traditional ETL approach requires access to data after it is moved to a central repository. This isn’t to say that you’ll never need to use ETL when you’re using a data fabric. There may be structural or other reasons for moving data, such as hardware upgrades, etc. But with a data fabric, you’ll find yourself moving much, much less data. And that’s typically a very good thing.
Comments