Understanding and effectively utilizing a confusion matrix and heatmap is crucial for anyone dealing with classification problems in machine learning. In this blog post, we’ll dive into the intricacies of these powerful tools, providing clear explanations and real-world examples to help you understand their significance. This comprehensive post will enhance your overall theoretical understanding, and you’ll learn how to implement the confusion matrix and heatmap in Python with the help of Seaborn and Sklearn.
What is a confusion matrix?
A confusion matrix is a table used in machine learning classification to evaluate the performance of a model on a set of data for which the true labels are known. It allows for a detailed analysis of how well the model is performing in terms of classification.
The confusion matrix is a square matrix that compares the predicted classes by the model with the actual classes. It has four components:
- True Positive (TP)
- True Negative (TN)
- False Positive (FP)
- False Negative (FN)
Actual Result | Actual Result | |
Predicted Result | TP | FP |
Predicted Result | FN | TN |
True Positive (TP)
True Positive (TP) is a term used in the context of classification models, particularly in a confusion matrix. It represents the number of instances where the model correctly predicts the positive class. In other words, it’s the count of instances where both the actual and predicted labels are positive.
For example, if you have a binary classification problem where the positive class represents a specific condition (e.g., presence of a disease), then True Positive would be the number of cases where the model correctly predicts the presence of the disease.
True Negative (TN)
It represents the number of instances where the model correctly predicts the negative class. In other words, it’s the count of instances where both the actual and predicted labels are negative.
For example, in a binary classification problem where the positive class represents the presence of a disease, True Negative would be the number of cases where the model correctly predicts the absence of the disease.
False Positive (FP)
It represents the number of instances where the model incorrectly predicts the positive class. In other words, it’s the count of instances where the actual label is negative, but the model predicts a positive label.
For example, in a binary classification problem where the positive class represents the presence of a disease, False Positive would be the number of cases where the model wrongly predicts the presence of the disease when it is not actually present.
False Negative (FN)
It represents the number of instances where the model incorrectly predicts the negative class. In other words, it’s the count of instances where the actual label is positive, but the model predicts a negative label.
For example, in a binary classification problem where the positive class represents the presence of a disease, False Negative would be the number of cases where the model wrongly predicts the absence of the disease when it is actually present.
Real Example of Confusion Matrix
Let’s assume we have conducted tests on 200 individuals, and the results are as follows:
- True Positive (TP): 40 individuals correctly diagnosed as COVID-19 positive.
- True Negative (TN): 130 individuals correctly diagnosed as COVID-19 negative.
- False Positive (FP): 10 individuals wrongly diagnosed as COVID-19 positive when they do not have it.
- False Negative (FN): 20 individuals wrongly diagnosed as COVID-19 negative when they actually have it.
The confusion matrix for this example would look like:
Actual Result | Actual Result | |
Predicted Result | 40 | 10 |
Predicted Result | 20 | 130 |
This confusion matrix allows us to evaluate the performance of a COVID-19 testing model by assessing how well it correctly identifies positive and negative cases. From this matrix, we can calculate metrics such as sensitivity (recall), specificity, precision, accuracy, and F1 score to gain a more comprehensive understanding of the model’s performance.
What is HeatMap?
A heatmap is a graphical representation of data in a matrix format, where values in a matrix are represented as colors. It is a way to visualize and analyze the relationships and patterns in a dataset. Heatmaps are particularly useful for displaying two-dimensional data, making them popular in various fields, including statistics, data analysis, and machine learning.
In a typical heatmap:
- Each row and column of the matrix represents a variable or category.
- The cells of the matrix contain values, and these values are color-coded based on their magnitudes.
- Colors are used to represent the intensity of the values, with different colors indicating different levels of magnitude.
Heatmaps are often employed to visualize correlation matrices, confusion matrices, or any other matrix-based data where patterns and relationships need to be highlighted. The color gradient allows for quick identification of high or low values and patterns within the data.
In Python, the seaborn library is commonly used for creating heatmaps. Here’s a simple example using seaborn:
import seaborn as sns
import numpy as np
# Example data
data = np.random.rand(5, 5)
# Create a heatmap using seaborn
sns.heatmap(data, annot=True, cmap="viridis")
# Display the heatmap
plt.show()
Result-
In this example, data
is a 5×5 matrix of random values, and sns.heatmap
is used to create a heatmap. The annot=True
parameter adds numerical annotations to the cells, and cmap
specifies the color map (in this case, “viridis”).
Confusion Matrices with Heatmaps
Creating a confusion matrix with a heatmap involves using a combination of libraries such as scikit-learn for generating the confusion matrix and seaborn for visualizing it. Let’s see an example here –
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
# Example confusion matrix
actual_labels = [1, 0, 1, 2, 0, 1, 2, 0, 2]
predicted_labels = [1, 0, 1, 2, 0, 1, 0, 0, 2]
cm = confusion_matrix(actual_labels, predicted_labels)
# Create a heatmap using Seaborn
sns.set(font_scale=2) # Adjust font size for better readability
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", linewidths=.5, square=True, cbar_kws={"shrink": 0.75}, xticklabels=[0, 1, 2], yticklabels=[0, 1, 2])
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()
Here, we have imported some necessary libraries:
numpy
for numerical operations.seaborn
for creating visualizations, particularly the heatmap.matplotlib.pyplot
for additional plotting functionalities.confusion_matrix
from scikit-learn to compute the confusion matrix.
Line number 7-9 creates an example confusion matrix using the confusion_matrix
function from scikit-learn. It takes two lists (actual_labels
and predicted_labels
) as inputs and computes the confusion matrix.
The later code uses seaborn to create a heatmap of the confusion matrix. It sets the font size for better readability, specifies the figure size, and then creates the heatmap using the sns.heatmap
function. Various parameters are used to customize the appearance of the heatmap, such as annotations, the color map (“Blues”), linewidths, square cells, and labels for the x-axis and y-axis. Finally, the plot is displayed using plt.show()
. The resulting heatmap visually represents the confusion matrix, making it easier to interpret the performance of a classification model.
Final result-
Here, we observe a 3×3 matrix represented in color-coded form. As seen in the 0,0 position, the value is 3, indicating that the predicted labels match the actual labels three times, and so on.