# Explained: How Much Data Do You Need to Train a Machine Learning System You can understand about 80 percent of how machine learning works if you understand one single concept, and that’s how the things that machines learn are simply data points represented in multi-dimensional space.

Once a mapping from real world to this feature space is done, machine learning algorithms simply work on the space to understand the real world, for example an image of a dog is simply one data point in that feature space if the image is a hundred pixels, by hundred pixels then it’s a single data point in 10,000 dimensional feature space.

Humans can grasp 2d space in 3d space easily, but we have difficulty grasping higher dimensional space, thankfully the mathematical concepts in any dimensional space are the same, and the same math that’s used in 2d can be used in higher dimensional space.

Computers are good at computation and they don’t have a problem dealing with this very high dimensional space, machine learning is broadly used for two problems, classification and prediction. Let’s take classification, classification algorithms find groupings of the data points, so how much data do you need to train a machine learning algorithm for classification?

Let’s say you want to design a simple thermometer, you need to know what range of temperature you’d like to measure, because no single thermometer can give you the range and accuracy across the whole spectrum, which may include measuring the temperatures of liquid-nitrogen off your fridge, the outside weather, your body temperature, your oven, the inside of a light bulb or a welding arc, for that matter.

Machine learning models work the same way, if you want to build a system to differentiate between dogs, cats and hamsters, then your training data should represent the variation of images of these pets, which in turn will map to the multi-dimensional space of interest.

I’ve just tried to show it in 3d for understanding but as you move to higher and higher dimensions, the data becomes very sparse, this means that sometimes even a million data points will not help if 99% of them represent dogs and only 1% represents cats and hamsters.

This means that simply having more data is useless, on the other hand if you had fewer data points like two or three for each class, then you can’t cover the whole space effectively. so remember that in multi dimensional space data is very sparse, and it’s even more difficult to represent those regions of interest with very few data points. What you ideally need is a set of data points that represent the range of variations, with each class that you would like to train the system with, then the system can generalize over the space and given a new data point in space, it can identify the animal correctly just like in this case.

How would you know if you had enough data? one way is to plot the error against the number of data points you have. The error should go down as you increase the training data, having said that there is no single way to see if we have enough data and there may be other caveats, here are a few. if there are too few data points or not having enough training, then the system generalize too much resulting in underfitting and therefore high errors.

If the model is tuned too much to the data then it could essentially memorize the data, and that results in overfitting which causes high error rates for unseen data. If the data is not balanced meaning having close to the same number of data points per class, then the data may be worthless even though it might have high accuracy. The bottom line is that your data should represent the problem space well both relevant as well as having a good distribution.