# How to evaluate classification model: Confusion Matrix How to evaluate classification model – Confusion Matrix, The earliest reference to the concept had been made by British Statistician Karl Pearson in 1904, which is still the most important metrics to evaluate classification model today. Let’s say we have two kinds of fish, and we need to develop a model to classify the fish into yellow one and red one. To do this, We could use Decision Tree or SVM or neural network or any other methods.

There are a bunch of model to choose from. But how do we decide which one is the best? For the fish classification problem, assuming we have a model to predict the fish as red fish and the yellow fish, how to measure the model performance? The first measurements come to mind is the accuracy of the model. The accuracy is the measurement of all the correctly identified cases.

Accuracy is straightforward and easy to calculate. However, it has a big drawback. Here is another example. We still have yellow fish and red fish. In this case, there are many more yellow fish than red fish and we want to identify the red fish among them.

If someone claims that they have a model of over 95 percent accuracy in this case. Can you trust them? The answer is no. Well, here is an example. Assuming we have a model simply labels almost every fish as yellow fish, given majority of the fish are yellow fish. So model can easily achieve 95 percent overall accuracy. Intuitively, we know that claiming almost all the fish are yellow fish is not helpful, and we should focus on identifying the red fish, which is the positive class.

Then the question is how do we find a matrix to achieve this goal? Now is the time to introduce Confusion matrix. In this matrix, the columns correspond to what the machine learning model predicted, and the rows correspond to the true labels. Since we only have two kinds of fish, positive class, which is the red fish and negative class, which is the yellow fish. The top left corner contains true negative.

These are the yellow fish that were correctly identified by the model. The true positive are in the bottom right corner. These are the red fish that were correctly identified by the model. The bottom left corner contains the false negative. Fast negative. are the fish that are actually red fish but predicted to be yellow fish by the the model.

Lastly, the top right corner contains the fast positive, where the fish that are actually yellow, but the model thinks they are red. In a summary, a confusion matrix clearly tells you where your model did well and where your model made mistakes. It is extremely useful for the case when the classes are very imbalanced and we are more interested in identifying positive class. In this case, red fish. So instead of using accuracy, we could calculate recall from the confusion matrix to measure the ability of our model to find out all the positive class.

Here is the precise definition of a recall. It is a number of true positive divided by the sum of the number of true positive and the number of false negative. I also want to point out that recall alone sometimes can be also misleading. If we label all the fish as red fish, then the recall simply goes to one. To solve this issue, we need another matrix, which is precision.

Precision represents ability of the classification model to only identify the relevant positive class. It is defined as the number of true positive divided by the sum of the true positive and the fast positive. Now we have two metrics, precision and recall Which one we should focus on to optimize a model.

Actually, there is a trade off between them. Improving imprecision typically reduce recall and vice versa. We can have a better understanding by looking at the following figure. While changing the threshold of the output of a logistic regression model. Precision and recall change accordingly, but there is always a tradeoff between them.

You could give a higher priority to maximizing either precision or recall depending on the problems you are trying to solve. For example, if we are trying to develop a model to find out all the people who will pay the loan on time. In this case, we would like to have the model to optimize precision. We will sacrifice some recall here, but we definitely don’t want to lend our money to anyone who will not pay the loan on time. For the case of disease screening. We might have the model to optimize recall.

We will sacrifice some precision here, but we do want to find out all the people who have the disease. Actually, we can also combine the two metrics into one metrics. This one is called F1 score. The F1 score is the harmonic mean of precision and recall, you can aim to maximize this number to obtain a better. output from your model.