library(data.table)
= fread("../data/oil_spill.csv") oil
Goal
The goal of this exercise is to learn how to deal with imbalanced classification problems. This consists of being able to select an appropriate performance metric and learn about methods to adjust the standard machine learning models to improve with respect to that metric.
Oil spill data
Data description
The data we will be using in this exercise was first used in the 1998 paper by Miroslav Kubat, et al. named “Machine Learning for the Detection of Oil Spills in Satellite Radar Images.”.
The dataset contains a total of 937 observations. Each observation represents a patch of one of nine satellite images and contains information about the patch number and whether an oil spill is present. The rows in the dataset are ordered by image and patch. The data does not contain the original images but extracted numerical features.
Data dictionary
- V1: The patch number
- V2 - V49: The features that were extracted from the images by the Canadian Environmental Hazards Detection System (CEHDS).
- V50: Whether an oil spill is present (encoded as 0) or not (encoded as 1)
Descriptive analysis and preprocessing
In our modeling approach we ignore spatial correlation in the patches and therefore drop the first column.
$V1 = NULL oil
We also encode the target variable as a factor and rename it.
$oilspill = factor(oil$V50, levels = c(0, 1), labels = c("no", "yes"))
oil$V50 = NULL oil
The following gives us a nice compact summary of the data.
::skim(oil) skimr
Name | oil |
Number of rows | 937 |
Number of columns | 49 |
Key | NULL |
_______________________ | |
Column type frequency: | |
factor | 1 |
numeric | 48 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
oilspill | 0 | 1 | FALSE | 2 | no: 896, yes: 41 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
V2 | 0 | 1 | 332.84 | 1931.94 | 10.00 | 20.00 | 65.00 | 132.00 | 32389.00 | ▇▁▁▁▁ |
V3 | 0 | 1 | 698.71 | 599.97 | 1.92 | 85.27 | 704.37 | 1223.48 | 1893.08 | ▇▁▃▃▂ |
V4 | 0 | 1 | 870.99 | 522.80 | 1.00 | 444.20 | 761.28 | 1260.37 | 2724.57 | ▇▇▆▂▁ |
V5 | 0 | 1 | 84.12 | 45.36 | 0.00 | 54.00 | 73.00 | 117.00 | 180.00 | ▃▇▅▃▃ |
V6 | 0 | 1 | 769696.38 | 3831151.03 | 70312.00 | 125000.00 | 186300.00 | 330468.00 | 71315000.00 | ▇▁▁▁▁ |
V7 | 0 | 1 | 43.24 | 12.72 | 21.24 | 33.65 | 39.97 | 52.42 | 82.64 | ▅▇▆▂▁ |
V8 | 0 | 1 | 9.13 | 3.59 | 0.83 | 6.75 | 8.20 | 10.76 | 24.69 | ▁▇▂▁▁ |
V9 | 0 | 1 | 3940.71 | 8167.43 | 667.00 | 1371.00 | 2090.00 | 3435.00 | 160740.00 | ▇▁▁▁▁ |
V10 | 0 | 1 | 0.22 | 0.09 | 0.02 | 0.16 | 0.20 | 0.26 | 0.74 | ▃▇▂▁▁ |
V11 | 0 | 1 | 109.89 | 61.46 | 41.00 | 83.50 | 99.80 | 115.40 | 901.70 | ▇▁▁▁▁ |
V12 | 0 | 1 | 0.25 | 0.09 | 0.02 | 0.20 | 0.24 | 0.29 | 0.66 | ▁▇▃▁▁ |
V13 | 0 | 1 | 0.31 | 0.12 | 0.03 | 0.24 | 0.29 | 0.35 | 0.83 | ▁▇▂▁▁ |
V14 | 0 | 1 | 0.48 | 0.22 | 0.05 | 0.33 | 0.43 | 0.61 | 1.23 | ▂▇▃▂▁ |
V15 | 0 | 1 | 0.18 | 0.08 | 0.01 | 0.13 | 0.17 | 0.22 | 0.65 | ▃▇▁▁▁ |
V16 | 0 | 1 | 0.30 | 0.20 | 0.01 | 0.13 | 0.27 | 0.41 | 1.12 | ▇▆▂▁▁ |
V17 | 0 | 1 | 77.41 | 304.33 | 4.82 | 21.09 | 34.72 | 65.95 | 6058.23 | ▇▁▁▁▁ |
V18 | 0 | 1 | 31.15 | 152.69 | 1.96 | 11.68 | 16.49 | 24.75 | 4061.15 | ▇▁▁▁▁ |
V19 | 0 | 1 | 0.91 | 0.68 | 0.13 | 0.37 | 0.70 | 1.06 | 2.60 | ▇▆▁▂▂ |
V20 | 0 | 1 | 0.23 | 0.08 | 0.02 | 0.18 | 0.22 | 0.27 | 0.65 | ▁▇▂▁▁ |
V21 | 0 | 1 | 0.29 | 0.11 | 0.02 | 0.23 | 0.27 | 0.32 | 0.77 | ▁▇▂▁▁ |
V22 | 0 | 1 | 76.09 | 22.94 | 47.66 | 55.85 | 69.09 | 85.22 | 126.08 | ▆▇▃▁▃ |
V23 | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
V24 | 0 | 1 | 0.64 | 0.30 | 0.17 | 0.45 | 0.50 | 0.96 | 1.33 | ▃▇▁▆▁ |
V25 | 0 | 1 | 559.91 | 612.46 | 132.78 | 221.97 | 351.67 | 422.12 | 2036.80 | ▇▁▁▁▂ |
V26 | 0 | 1 | 0.58 | 0.70 | -0.71 | 0.18 | 0.87 | 1.01 | 1.83 | ▂▆▁▇▂ |
V27 | 0 | 1 | 7.50 | 3.97 | 2.96 | 4.66 | 5.07 | 12.06 | 14.78 | ▇▁▃▂▂ |
V28 | 0 | 1 | 0.61 | 0.85 | -1.79 | 0.09 | 0.48 | 0.98 | 5.72 | ▁▇▂▁▁ |
V29 | 0 | 1 | 4.27 | 3.62 | 1.44 | 2.64 | 3.33 | 4.50 | 39.42 | ▇▁▁▁▁ |
V30 | 0 | 1 | -2.83 | 1.62 | -7.81 | -3.25 | -2.78 | -1.62 | 1.28 | ▂▁▇▇▁ |
V31 | 0 | 1 | -0.43 | 0.22 | -1.37 | -0.53 | -0.38 | -0.28 | 0.00 | ▁▁▂▇▃ |
V32 | 0 | 1 | 1.82 | 0.64 | 0.00 | 1.22 | 1.95 | 2.17 | 2.98 | ▁▅▂▇▂ |
V33 | 0 | 1 | 0.00 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 0.87 | ▇▁▁▁▁ |
V34 | 0 | 1 | 1.82 | 0.64 | 0.00 | 1.21 | 1.95 | 2.17 | 2.98 | ▁▅▂▇▂ |
V35 | 0 | 1 | 43.09 | 95.17 | 3.00 | 12.00 | 23.00 | 39.00 | 1695.00 | ▇▁▁▁▁ |
V36 | 0 | 1 | 2432.69 | 5219.38 | 360.00 | 720.00 | 1350.00 | 2160.00 | 95310.00 | ▇▁▁▁▁ |
V37 | 0 | 1 | 0.01 | 0.01 | 0.00 | 0.00 | 0.01 | 0.01 | 0.02 | ▅▁▇▁▁ |
V38 | 0 | 1 | 31.24 | 31.58 | 5.05 | 13.45 | 23.63 | 37.76 | 441.23 | ▇▁▁▁▁ |
V39 | 0 | 1 | 91.19 | 21.98 | 64.00 | 78.00 | 82.00 | 99.00 | 143.00 | ▆▇▃▁▃ |
V40 | 0 | 1 | 60.55 | 13.84 | 39.00 | 50.00 | 55.00 | 67.00 | 86.00 | ▂▇▅▂▃ |
V41 | 0 | 1 | 933.93 | 1001.68 | 0.00 | 450.00 | 685.42 | 1053.42 | 11949.33 | ▇▁▁▁▁ |
V42 | 0 | 1 | 427.57 | 715.39 | 0.00 | 180.00 | 270.00 | 460.98 | 11500.00 | ▇▁▁▁▁ |
V43 | 0 | 1 | 255.44 | 534.31 | 0.00 | 90.80 | 161.65 | 265.51 | 9593.48 | ▇▁▁▁▁ |
V44 | 0 | 1 | 106.11 | 135.62 | 0.00 | 50.12 | 73.85 | 125.81 | 1748.13 | ▇▁▁▁▁ |
V45 | 0 | 1 | 5.01 | 5.03 | 0.00 | 2.37 | 3.85 | 6.32 | 76.63 | ▇▁▁▁▁ |
V46 | 0 | 1 | 0.13 | 0.33 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
V47 | 0 | 1 | 7985.72 | 6854.50 | 2051.50 | 3760.57 | 5509.43 | 9521.93 | 55128.46 | ▇▁▁▁▁ |
V48 | 0 | 1 | 61.69 | 10.41 | 35.95 | 65.72 | 65.93 | 66.13 | 66.45 | ▂▁▁▁▇ |
V49 | 0 | 1 | 8.12 | 2.91 | 5.81 | 6.34 | 7.22 | 7.84 | 15.44 | ▇▂▁▁▂ |
After inspecting the distribution in more detail, we notice the following:
- The target feature oilspill is highly imbalanced. There are 896 observations without and only 41 with an oil spill
- The feature V23 is constantly 0 and we can remove it
- The feature V33 only has 4 non-zero values and we could drop it, as we do not expect our machine learning algorithm to learn a lot from it
- The features are not scaled. Because we will only use tree-based learners in this exercise, this is not a problem
$V33 = NULL
oil$V23 = NULL oil
1 Benchmarking standard algorithms
We will start by comparing two standard ML algorithms - a classification tree and a random forest - without taking the imbalanced class distribution into account. Inspecting some standard measures will reveal problems that will be addressed in the subsequent exercises.
Start by creating a classification task with "oilspill"
as the target variable and all other variables as features (except the ones we removed earlier). Set “yes” as the positive class and stratify with respect to the target variable to ensure the same class distribution in each fold.
Then, compare a classification tree with a random forest with respect to their accuracy, FPR and TPR. As a validation strategy we use repeated (5 times) 3-fold stratified crossvalidation. We only use 3 folds because we have very few positive labels and we repeat the crossvalidation because the dataset is small.
Inspect the results and answer whether accuracy is a good metric for this problem.
Recap: Stratification
Stratification consists of dividing the population into subsets (called strata) within each of which an independent sample is selected. When setting the column role "stratum"
, resamplings that are applied to the task will automatically take the set stratum into account.
For binary classification problems, the column role "positive"
is important, because metrics like the TPR and FPR can only be understood when knowing what is defined as the positive class.
Hint 1:
The function as_task_classif()
can help to create a classification task from the dataset. You can set the positive class and the stratification role by changing task$positive
and task$col_roles$stratum
after creating the task.
Then, create the benchmark design using benchmark_grid()
and execute it using benchmark()
. To calculate the performance metrics use the method $aggregate()
, which takes a list of measures that can be constructed using msrs()
.
Hint 2:
library(mlr3verse)
= as_task_classif(...)
task $col_roles$stratum = ...
task$positive = ...
task
= lrns(...)
learners = benchmark_grid(...)
design = benchmark(...)
bmr $aggregate(...) bmr
2 Selecting a suitable performance metric
When selecting a suitable performance metric, we have to take the properties of the task into consideration. The detection of a spill requires mobilizing an expensive response, and missing an event is equally expensive, causing damage to the environment. Therefore, both class labels are important.
For that reason we want to select a measure that is insensitive to changes in the class distribution in the test data. Can you modify the definition of the accuracy so that it does not depend on the distribution of the target variable? Compare this new metric with the standard accuracy used in the previous exercise.
Hint 1:
The accuracy can be defined as \[ACC = P(\hat{Y} = 1 | Y = 1) \times P(Y = 1) + P(\hat{Y} = 0 | Y = 0) \times P(Y = 0)\] Note that \(TPR = P(\hat{Y} = 1 | Y = 1)\) is the true positive rate and \(TNR = P(\hat{Y} = 0 | Y = 0)\) the true negative rate, where TNR = 1 - FPR (false positive rate). Hence, the accuracy can be viewed as a weighted average of TPR and TNR, where \(P(Y = 1)\) (and \(P(Y = 0) = 1 - P(Y = 1)\)) are used as the corresponding weights. The new metric should not depend on the class distribution \(P(Y = 1)\) (and \(P(Y = 0) = 1 - P(Y = 1)\)), i.e., we can equally weight the TPR and TNR to obtain a metric that does not take into account the class distribution. The resulting metric is known as the balanced accuracy: \(BACC = 0.5 \cdot TPR + 0.5 \cdot TNR = 0.5 \cdot TPR + 0.5 \cdot (1-FPR)\)
Hint 2:
Read msr("classif.bacc")$help()
.
3 Upsamling the minority class
Although we have selected a performance metric (balanced accuracy) that is insensitive to the class distribution in the test data, we have considerably fewer positive than negative observations in the training data. This will make the random forest focus on the latter (no spill).
To address that, create a machine learning pipeline that first upsamples the minority class by a factor of two and then fits a random forest. Add it to the benchmark result and compare the models with respect to the balanced accuracy.
Recap: Upsampling
Upsampling is a procedure where synthetically generated data points (corresponding to minority class) are injected into the dataset.
Hint 1:
Use po("classbalancing")
and combine it with the learner. PipeOps can be chained to a graph using %>>%
. You can convert a graph to a learner using as_learner()
Remember to use the instantiated resampling from the previous benchmark experiment. A new resample result can be added to a benchmark result using c(...)
.
Hint 2:
= po(
graph "classbalancing",
ratio = ...,
reference = ...,
adjust = ...
%>>%
) lrn(...)
= design$resampling[[1L]]
resampling
= as_learner(...)
learner_balanced
= resample(...)
rr
= c(...)
bmr
autoplot(bmr, measure = ...)
4 Additional downsampling of the majority class
Repeat the previous experiment, but now not only upsample the minority class like in the previous exercise, but also downsample the majority class to the same count.
Recap: Downsampling
Downsampling is a mechanism that reduces the count of training samples falling under the majority class.
Hint 1:
Adjust the pipeline from the previous exercise. Have a look what the adjust
parameter of po("classbalancing")
can be set to.
Hint 2:
= po(
graph "classbalancing",
ratio = ...,
reference = ...,
adjust = ...
%>>%
)
...
= as_learner(graph)
learner_balanced_all $id = ... # Set an id so that you can distinguish the pipelines when plotting the results
learner_balanced_all ...
5 Instance-specific weights
Add another logistic learner to the benchmark that uses instance-specific weights, assigning each observation in the minorty class double the weight compared to the majority class. Further, add a simple logistic regression learner to assess the additional performance difference due to instance-specific weights-
Hint 1:
Use a graph that contains po("classweights")
to specify the weights before the model training.
Bonus exercise: Tuning the sampling rate
In the previous exercise we set the ratio to 2. See if tuning this value improves the result. Construct an auto_tuner
with a 3-fold inner CV and a grid search over a suitable range of ratio.
Hint 1:
= po("classbalancing", reference = "minor", adjust = "all") %>>%
graph lrn("classif.ranger")
= as_learner(graph)
learner_balanced_tuned
= ps(classbalancing.ratio = ...)
search_space
= auto_tuner(
balanced_autotuner tuner = tnr(..., resolution = ...),
learner = ...,
resampling = rsmp(...),
measure = msr(...),
search_space = search_space
)
= resample(task, balanced_autotuner, resampling)
rr
= c(bmr, rr)
bmr
autoplot(bmr, measure = msr("classif.bacc"))
Summary
In this exercise we addressed the problem of imbalanced class distribution in classification problems. We have seen that standard metrics like accuracy can be misleading for such problems and learned how it can be modified to obtain a balanced accuracy. We then learned how to change the training distribution - using up- and downsampling - to improve the results. Finally, we got a better understanding of the importance of using stratification when having only very few positive labels.
A similar usecase can be found here: https://mlr-org.com/gallery/2020-03-30-imbalanced-data/