Imabalanced ROC-Analysis threshold Tuning

Train a classifier on German Credit set and tune the output of a probabilistic model with ROC threshold analysis.

Authors

Goal

In this exercise, we will create a machine learning model that predicts the credit risk of an individual (e.g., the probability of being a good or bad credit applicant for the bank). Our goal is not to obtain an optimal classifier for this task, but to learn how to get a better understanding of the predictions made by this model. This means looking at its sensitivity (ability to correctly identify positives) and specificity (ability to correctly identify negatives). The sensitivity is also known as the true positive rate (TPR) and the specificity is equal to (1 - FPR) where FPR is the false positive rate.

We will also cover how to obtain different response predictions from a probabilistic model by modifying the threshold. We will inspect this relationship via the ROC curve and tune the threshold for a given classifier to optimize our response predictions.

1 Training a classification tree on the german credit task

First load the pre-defined German credit task and set the positive class to "good". Train a random forest on 2/3 of the data (training data) and make probabilistic predictions on the remaining 1/3 (test data).

Hint 1:
  • Create the German credit task using tsk() and set the positive class by modifying e.g. task$positive.
  • Create a learner using lrn() and make sure to specify the predict_type so that the learner will predict probabilities instead of classes.
  • When calling the methods $train() and $predict() of the learner, you can pass an argument row_ids to specify which observations should be used for the train and test data.
  • You can generate random train-test splits using, e.g., the partition() function.
Hint 2:
library(mlr3verse)

task = tsk(...)
task$positive = ...
learner = lrn(..., predict_type = ...)
ids = partition(...)
learner$train(..., row_ids = ...)
pred = learner$predict(..., row_ids = ...)

2 Confusion matrices and classification thresholds

Inspect and save the confusion matrix of the predictions made in the previous exercise. Manually calculate the FPR and TPR using the values from the confusion matrix. Can you think of another way to compute the TPR and FPR using mlr3 instead of manually computing them using the confusion matrix?

Recap

A confusion matrix is a special kind of contingency table with two dimensions “actual” and “predicted” to summarize the ground truth classes (truth) vs. the predicted classes of a classifier (response).

Binary classifiers can be understood as first predicting a score (possibly a probability) and then classifying all instances with a score greater than a certain threshold \(t\) as positive and all others as negative. This means that one can obtain different class predictions using different threshold values \(t\).
Hint 1: A prediction object has a field $confusion. Since good was used as the positive class here, the TPR is \(P(\hat{Y} = good | Y = good)\) and the FPR is \(P(\hat{Y} = good | Y = bad)\) (where \(\hat{Y}\) refers to the predicted response of the classifier and \(Y\) to the ground truth class labels). Instead of manually computing the TPR and FPR, there are appropriate performance measures implemented in mlr3 that you could use.
Hint 2:

You need to replace ... in the code below to access the appropriate columns and rows, e.g., confusion1[1, 1] is the element in the first row and first column of the confusion matrix and tells you how many observations with ground truth \(Y = good\) were classified into the class \(\hat{Y} = good\) by the learner.

confusion1 = pred$confusion
TPR1 =  confusion1[...] / sum(confusion1[...])
TPR1
FPR1 = confusion1[...] / sum(confusion1[...])
FPR1

The names of the TPR and FPR performance measures implemented in mlr3 can be found by looking at as.data.table(mlr_measures). You can use the code below and pass the names of the mlr3 measures in a vector to compute both the TPR and FPR:

pred$score(msrs(...))

3 Asymmetric costs

Think about which type of error is worse for the given task and obtain new predictions (without retraining the model) that takes this into account.

Then calculate the FPR and TPR and compare it with the results from the previous exercise.

Hint 1: A prediction object has the method $set_threshold() that can be used to set a custom threshold and which will update the predicted classes according to the selected threshold value.
Hint 2:
pred$set_threshold(...)
confusion2 = pred$confusion
TPR2 =  confusion2[...] / sum(...)
FPR2 = confusion2[...] / sum(...)

TPR2 - TPR1
FPR2 - FPR1

4 ROC curve

In the previous two exercises, we have calculated the FPR and TPR for two thresholds. Now visualize the FPR and TPR for all possible thresholds, i.e. the ROC curve.

Recap The receiver operating characteristic (ROC) displays the sensitivity and specificity for all possible thresholds.
Hint 1: You can use autoplot() on the prediction object and set the type argument to produce a ROC curve. You can open the help page of autoplot for a prediction object using ?mlr3viz::autoplot.PredictionClassif.
Hint 2:
autoplot(pred, type = ...)

5 Threshold tuning

In this exercise, we assume that predicting a false positive is 4 times worse than a false negative. Use a measure that considers classification costs (e.g., misclassification costs msr("classif.costs")$help()) and tune the threshold of our classifier to systematically optimize the asymmetric cost function.

5.1 Cost Matrix

First, define the cost matrix. Here, this is a 2x2 matrix with rows corresponding to the predicted class and columns corresponding to the true class. The first row/column implies "good", the second "bad" credit rating.

Hint 1:

The order of the classes in the rows and columns of the matrix must correspond to the order of classes in task$class_names.

5.2 Cost-Sensitive Measure

Next, define a cost-sensitive measure. This measure takes one argument, which is a matrix with row and column names corresponding to the class labels in the task of interest.

Hint 1: You can use as.data.table(mlr_measures) to find the relevant measure.

5.3 Thresholding

In default settings, a model will classify a customer as good credit if the predicted probability is greater than 0.5. Here, this might not be a sensible approach as we would likely act more conservatively and reject more credit applications with a higher threshold due to the non-uniform costs. Use the autplot() function to plot the costs associated with predicting at various thresholds between 0 and 1 for the random forest predictions stored in the pred object from before.

Hint 1: You need to specify type = "threshold within autoplot as well as the previously defined measure.

5.4 Tuning the Threshold

The optimal threshold can be automated via po("tunethreshold"). Create a graph that consists of a logistic regression learner and this threshold tuning pipeline object. Then, turn the graph into a learner as in previous tutorials. Finally, benchmark the pipeline against a standard logistic regression learner using 3-fold CV.

Hint 1:

You can use this code skeleton for the pipeline:

logreg = po("learner_cv", lrn(...)) # base learner
graph =  logreg %>>% po(...) # graph with threshold tuning
Hint 2:

You can use this code skeleton for the benchmark:

learners = list(..., lrn("classif.log_reg"))
bmr = benchmark(benchmark_grid(task, learners,
  rsmp("cv", folds = 3)))

6 ROC Comparison

In this exercise, we will explore how to compare to learners by looking at their ROC curve.

The basis for this exercise will be a benchmark experiment that compares a classification tree with a random forest on the german credit task.

Because we are now not only focused on the analysis of a given prediction, but on the comparison of two learners, we selected a 10-fold cross-validation to reduce the uncertainty of this comparison.

Conduct the benchmark experiment and show both ROC curves in one plot. Which learner learner performs better in this case?

Hint 1: Use benchmark_grid() to create the experiment design and execute it using benchmark(). You can also apply the function autoplot() to benchmark results.
Hint 2:
resampling = rsmp(..)
learners = lrns(...)
design = benchmark_grid(...)
bmr = benchmark(...)
autoplot(...)

7 Area under the curve

In the previous exercise we have learned how to compare to learners using the ROC curve. Although the random forest was dominating the classification tree in this specific case, the more common case is that the ROC curves of two models are crossing, making a comparison in the sense of \(>\) / \(<\) impossible.

The area under the curve tries to solve this problem by summarizing the ROC curve by its area under the curve (normalized to 1), which allows for a scalar comparison.

Compare the AUC for the benchmark result.

Hint 1: You can use the autoplot() function and use the AUC as performance measure in the measure argument.
Hint 2:
autoplot(bmr, measure = msr(...))

Bonus exercise: Unbiased performance estimation

Revisit the exercise where we tuned the threshold. Is the performance estimate for the best threshold unbiased? If no, does this mean that our tuning strategy was invalid?

Hint 1: Did we use an independent test set?
Hint 2: Think of the uncertainty when estimating the ROC curve.

Summary

In this exercise we improved our understanding of the performance of binary classifiers by the means of the confusion matrix and a focus on different error types. We have seen how we can analyze and compare classifiers using the ROC and how to improve our response predictions using threshold tuning.