How to plot the AUC — ROC Curve using Python?
Let me explain a simple ROC graph and provide a basic understanding using an example. I’ll use a straightforward dataset so you can experiment and test the underlying theory. Let’s apply this to classify breast cancer data.
In this example, we are tackling a classification problem, where the goal is to classify breast cancer based on some input data.
Code in Python
from sklearn import svm, datasets
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
# Load Data
breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target
# Split the Dataset to Train and Test
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33, random_state=44)
# Train the Model
clf = LogisticRegression(penalty='l2', C=0.1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
# Test the model using AOC-ROC Graph
y_pred_proba = clf.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
TPR — True Positive Rate
A true positive rate (TPR) of 1 means the model correctly predicted breast cancer for the entire dataset.
Mathematically, TPR is calculated as TPR = TP/(TP+FN), where TP stands for true positives and FN stands for false negatives.
TPR is also known as Recall. If the TPR is 1, it indicates that there were no mistakes for that particular point.
FPR — False Positive Rate
FPR measures how often a model incorrectly classifies negative instances as positive. A high FPR means the model frequently predicts a positive outcome when it should be negative.
For example, in breast cancer diagnosis, a high FPR would mean the model often says a patient has cancer when they don’t.
In pregnancy tests, a high FPR would indicate the test frequently shows positive when the woman isn’t pregnant.
An ideal model should have a low FPR, close to 0. This means it correctly identifies true negatives, minimizing false positives. Achieving a low FPR is crucial when false positives have significant consequences, like medical diagnoses or security screenings.
A model with a low FPR is a remarkable achievement, accurately classifying negative instances and maintaining trust in its predictions.
Combining the TPR and FPR = AUROC
To evaluate the model’s performance on test data, plot the ROC curve and examine the Area Under the Curve (AUC) value. If the AUC is close to 1, the model performs well, indicating you have an excellent model. In the code above, I achieved an AUC of approximately 0.99, which suggests that the Logistic Regression model is highly effective.