What Are the Critical Steps in Data Preparation?
Data Prep is an Essential Process that helps Businesses Make Quick and Informed Decisions
Kobe Bryant was surely on to something when he said, “If you want to be a better player, you have to prepare, prepare and prepare some more.”
It’s not always glamorous or exciting for an analyst to spend most of his or her time preparing data, but when done correctly, preparing data before it enters the pipeline can generate insights that quickly turn into impactful business decisions. In its raw form, data may contain outliers, unnecessary tables, formatting errors and so forth. Similar to a young Kobe Bryant, data has to be polished, refined and arranged before it is ready to create Hall of Fame outcomes.
Here are essential steps that analysts, data scientists, developers, business users, and anyone else should take when preparing data.
Step 1: Identify Data Outliers
Outliers are unusually high or low data points that can skew results and create a false view of what the data is actually telling you. What qualifies as an outlier can be subjective, but the most common method is using the 1.5 IQR rule. IQR stands for “interquartile range” and represents the range between the 25th and 75th percentile in your data. If a data point is 1.5 times the distance from your data’s 75th percentile value, it is considered an outlier.
So what do you do with the data that is considered an outlier? You can do these things to it: determine if the outliers should be replaced, removed or included due to attributes that might be helpful or essential in the overall analysis.
Before removing the outliers:
Ask yourself if the observations that contain outliers also contain information that might be valuable. For example, if you have a dataset on gyms in a city, you might find that one of them doesn’t have a basketball court. But it might also have relevant non-outlier data on other attributes like a monthly membership fee, gym equipment, proximity to a popular neighbourhood and so forth.
Also factor in the size of your dataset before removing outliers. If you have a massive dataset and only a few outliers, it may not cause any problems when discarding the observation outliers. On the other hand, removing outliers with limited data might not be the best strategy.
Lastly, ask yourself if this outlier is the result of a mistake or glitch. Or, is it more likely an accurate reading? If the latter is the case, would discarding it cause your model to be misrepresentative?
Replacing the outliers:
If your answers to these questions led you to believe that deleting the outlier data isn’t the best route, you can still avoid letting the outliers skew your results by replacing it with data that is a reasonable approximation of what the data should be. This can be done by using central tendency calculations such as the mean, median or mode. If your outlier represents a gym with a tiny facility, simply replace it with the average square footage of the gyms in your dataset.
Carrying on with the gym example, you could also replace that outlier with data from related observations. Let’s say you know the average facility size of gyms in your area. You could simply insert the average size where the outlier data once lived.
Lastly, you can trim the outliers by setting them to the value of a specified percentile, also known as ‘Winsorzing’.
Step 2: Handle Missing Data
In the same breath, you may have findings with missing data points. If this is the case, you can go through a process similar to the one described for the outliers. Make sure to ask yourself:
Do the observations with missing data contain other attributes which might be useful or important to the overall analysis?
Can we afford to discard the data or is the dataset big enough?
If your answers to these questions lead you to believe that discarding the observations with missing data points isn’t the best route, use one of the methods for replacing the outliers described above.
Step 3: Handle Inconsistencies
Inconsistent formatting, spelling and case can give you false results or prevent your model from running correctly. For example, if you have data that identifies the number of swimming pools in a workout facility, and some indicate “01” pool, and another identifies “one” pool and yet another indicates “1” pool, your model might consider these to be three different categories.
Once these inconsistencies are cleared, whether that be manually or through a program, handling formatting issues is a key step in ensuring that your data is an accurate representation of what you’re attempting to analyze.
Now, if all of this sounds too manual and maybe too big of an investment of your time, it is. At Promethium, we provide powerful machine intelligence that automatically finds relationships across siloed data, creates joins and writes queries, so you spend less time preparing and assembling data.
Analyzing is the fun part. Finding and assembling the data...not so much. Promethium cuts the process from months to minutes.