Data Readiness: Using the Right Data – Data, and more specifically, the right data is critical to AI machine learning success. Stick around to find out why. I introduce this concept of the right data which is critical for success in AI and machine learning. The reason being is that these fields rely on data in order to train, test and optimize predictive and prescriptive models that perform well in the real world.
The six factors that contribute to data readiness are data quantity, depth, balance, representativeness, completeness and cleanliness. Let’s talk about each one of these factors in turn. Okay, when it comes to data quantity in terms of data readiness, more data is almost always gonna be better. The reason being is that these AI and machine learning models are trying to sort of automatically discover underlying correlations, relationships, patterns and so on, and the more data that you have, the more likely that these discoveries from these models will be representative, accurate and statistically significant.
Also, having more data allows you to create simpler models that are better performing, which is great for minimizing the need for expensive resources and tons of time to train models and so on. In terms of data depth, in machine learning, there’s this idea of data features. Normally when you think of just normal data analysis, things like that or databases, people tend to think in terms of data fields or data attributes.
So let’s say you have data in your CRM about people, you’ll have attributes or data fields like name, phone number, email address, and so on. In the case of machine learning, you almost always refer the same exact thing as data features. And these features can take on many different values. Some might be binary like yes or no or 1 or 0, some might be a value that’s just a continuous number, some might be a category, like a job title, for example, and so on.
So what happens is if you have all this data that you’re trying to use to create a machine learning model, but you don’t have enough variation in those different features, then it’ll be very difficult for these models to really truly understand the underlying relationship between these different features and their impact on the ultimate goal, which might be making a prediction or a recommendation or something like that.
So it’s really important when we talk about data depth, meaning that you have a lot of data that has a lot of variation that is representative of the real world variation that you would see with these features so that these models can learn all those underlying sort of relationships and patterns very well. In terms of data balance, that’s this idea that the dataset is relatively well balanced in terms of the target variable or the thing, like let’s say you’re trying to determine whether an email is spam or not spam or you’re taking a look at a radiographic image of a tumor and trying to determine whether it’s cancer or not cancer.
Since that’s the thing you’re trying to determine, that’s your target, or the thing you’re trying to predict, then you really need data that’s as well balanced as possible where you have examples of the positive target and the negative target. Or there may be more than just two options, but let’s say it’s spam or not spam or cancer or not cancer.
If you have a dataset that only has not cancer examples, except for maybe one, maybe you have thousands and thousands of examples or records or whatever it is or images, but only one of them is of a cancerous tumor, or only one of them is of a spam email, then the model’s not really gonna develop a good understanding of what a spam email looks like or what a cancerous tumor looks like. So that’s what we mean by data balance that’s very important.
So in terms of data representativeness, it’s very similar to data depth in that you’re basically saying that the data that you have and that you’re training and testing and optimizing these models on is representative of all the different variety and ranges and things that these models might come across in terms of data in the real world, but also, and something that you’re hearing more and more about today is this idea of bias.
Having sort of models that are biased in some way because of the underlying data. And so, representativeness is also about having non-skewed or non-biased data samples that are inclusive and able to not exhibit any bias because they’re representative of all the conditions and scenarios that they’ll come across in a non-biased way. Data completeness is this idea that your data that you’re using for these models also includes all the underlying factors that have a significant enough impact on the thing that you’re targeting or trying to predict or making a recommendation or whatever it is.
You may have a dataset with a lot of great data that is very related to sort of that target variable like spam or not spam, but you also might be missing certain features, data features or certain underlying data that strongly influences whether a tumor is cancerous or an email is spam or a fluctuation in a stock price or something like that, if that data that really has a big impact and is highly predictive of that target is not present in your dataset, then you don’t really have complete data, and you need to get that complete data so that you can build a model that truly is able to kind of map from those inputs to whatever it is you’re trying to create as an output if it’s a prediction, let’s say.
And so, a lot of times, to do that, either you have to get more data that has new features and new values and variables that you didn’t have already, or sometimes data scientists will do what’s called data engineering, and they’ll create new data features from existing data features. So in either case, it’s really important though that whenever you’re, especially in predictive and prescriptive analytics, whenever you’re trying to kind of optimize something, predict something or make recommendations, that your data does have all the, as many of the primary predictive factors or variables in it.
Finally, data cleanliness is a really important factor of data readiness as well. And what do we mean by that? Well, a lot of data is really dirty, or it has a lot of errors, or it has missing values, or it has formatting issues, or it has multiple different values for the same value or whatever. And so, a lot of times, data scientists need to go in, or machine learning engineers, and they need to clean up the data, fill in the missing values, get rid of data that has bad values or errors, things like that, and a lot of times you’ll hear that referred to as data munging, data wrangling, data processing, data transformation, data preparation.
There’s a lot of different terms for this, but essentially, a big part of the job of building these sort of AI machine learning models is certainly preparing data to create well performing models. And that whole process can be a little messy because data often, as much as we’d like it to be nice and clean, isn’t. All right, well hopefully it’s become clear that data, and the right data, is critical for success with AI and machine learning.
And in particular, in terms of building successful, well-performing models that are deployed and used in the real world. If you’ve ever come across a variety of these data issues, or even ones we didn’t discuss today, let us know in the comments.
Web enthusiast. Thinker. Evil coffeeaholic. Food specialist. Reader. Twitter fanatic. Music maven. AI and Machine Learning!