JavaScript is required to unlock solutions.
Please enable JavaScript and reload the page,
or download the source files from GitHub and run the code locally.

Goal

The goal of this exercise is to learn how to deal with imbalanced classification problems. This consists of being able to select an appropriate performance metric and learn about methods to adjust the standard machine learning models to improve with respect to that metric.

Oil spill data

Data description

The data we will be using in this exercise was first used in the 1998 paper by Miroslav Kubat, et al. named “Machine Learning for the Detection of Oil Spills in Satellite Radar Images.”.

The dataset contains a total of 937 observations. Each observation represents a patch of one of nine satellite images and contains information about the patch number and whether an oil spill is present. The rows in the dataset are ordered by image and patch. The data does not contain the original images but extracted numerical features.

Data dictionary

V1: The patch number
V2 - V49: The features that were extracted from the images by the Canadian Environmental Hazards Detection System (CEHDS).
V50: Whether an oil spill is present (encoded as 0) or not (encoded as 1)

Descriptive analysis and preprocessing

library(data.table)
oil = fread("../data/oil_spill.csv")

In our modeling approach we ignore spatial correlation in the patches and therefore drop the first column.

oil$V1 = NULL

We also encode the target variable as a factor and rename it.

oil$oilspill = factor(oil$V50, levels = c(0, 1), labels = c("no", "yes"))
oil$V50 = NULL

The following gives us a nice compact summary of the data.

skimr::skim(oil)

Data summary
Name	oil
Number of rows	937
Number of columns	49
Key	NULL
_______________________
Column type frequency:
factor	1
numeric	48
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
oilspill	0	1	FALSE	2	no: 896, yes: 41

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
V2	1	332.84	1931.94	10.00	20.00	65.00	132.00	32389.00	▇▁▁▁▁
V3	1	698.71	599.97	1.92	85.27	704.37	1223.48	1893.08	▇▁▃▃▂
V4	1	870.99	522.80	1.00	444.20	761.28	1260.37	2724.57	▇▇▆▂▁
V5	1	84.12	45.36	0.00	54.00	73.00	117.00	180.00	▃▇▅▃▃
V6	1	769696.38	3831151.03	70312.00	125000.00	186300.00	330468.00	71315000.00	▇▁▁▁▁
V7	1	43.24	12.72	21.24	33.65	39.97	52.42	82.64	▅▇▆▂▁
V8	1	9.13	3.59	0.83	6.75	8.20	10.76	24.69	▁▇▂▁▁
V9	1	3940.71	8167.43	667.00	1371.00	2090.00	3435.00	160740.00	▇▁▁▁▁
V10	1	0.22	0.09	0.02	0.16	0.20	0.26	0.74	▃▇▂▁▁
V11	1	109.89	61.46	41.00	83.50	99.80	115.40	901.70	▇▁▁▁▁
V12	1	0.25	0.09	0.02	0.20	0.24	0.29	0.66	▁▇▃▁▁
V13	1	0.31	0.12	0.03	0.24	0.29	0.35	0.83	▁▇▂▁▁
V14	1	0.48	0.22	0.05	0.33	0.43	0.61	1.23	▂▇▃▂▁
V15	1	0.18	0.08	0.01	0.13	0.17	0.22	0.65	▃▇▁▁▁
V16	1	0.30	0.20	0.01	0.13	0.27	0.41	1.12	▇▆▂▁▁
V17	1	77.41	304.33	4.82	21.09	34.72	65.95	6058.23	▇▁▁▁▁
V18	1	31.15	152.69	1.96	11.68	16.49	24.75	4061.15	▇▁▁▁▁
V19	1	0.91	0.68	0.13	0.37	0.70	1.06	2.60	▇▆▁▂▂
V20	1	0.23	0.08	0.02	0.18	0.22	0.27	0.65	▁▇▂▁▁
V21	1	0.29	0.11	0.02	0.23	0.27	0.32	0.77	▁▇▂▁▁
V22	1	76.09	22.94	47.66	55.85	69.09	85.22	126.08	▆▇▃▁▃
V23	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	▁▁▇▁▁
V24	1	0.64	0.30	0.17	0.45	0.50	0.96	1.33	▃▇▁▆▁
V25	1	559.91	612.46	132.78	221.97	351.67	422.12	2036.80	▇▁▁▁▂
V26	1	0.58	0.70	-0.71	0.18	0.87	1.01	1.83	▂▆▁▇▂
V27	1	7.50	3.97	2.96	4.66	5.07	12.06	14.78	▇▁▃▂▂
V28	1	0.61	0.85	-1.79	0.09	0.48	0.98	5.72	▁▇▂▁▁
V29	1	4.27	3.62	1.44	2.64	3.33	4.50	39.42	▇▁▁▁▁
V30	1	-2.83	1.62	-7.81	-3.25	-2.78	-1.62	1.28	▂▁▇▇▁
V31	1	-0.43	0.22	-1.37	-0.53	-0.38	-0.28	0.00	▁▁▂▇▃
V32	1	1.82	0.64	0.00	1.22	1.95	2.17	2.98	▁▅▂▇▂
V33	1	0.00	0.05	0.00	0.00	0.00	0.00	0.87	▇▁▁▁▁
V34	1	1.82	0.64	0.00	1.21	1.95	2.17	2.98	▁▅▂▇▂
V35	1	43.09	95.17	3.00	12.00	23.00	39.00	1695.00	▇▁▁▁▁
V36	1	2432.69	5219.38	360.00	720.00	1350.00	2160.00	95310.00	▇▁▁▁▁
V37	1	0.01	0.01	0.00	0.00	0.01	0.01	0.02	▅▁▇▁▁
V38	1	31.24	31.58	5.05	13.45	23.63	37.76	441.23	▇▁▁▁▁
V39	1	91.19	21.98	64.00	78.00	82.00	99.00	143.00	▆▇▃▁▃
V40	1	60.55	13.84	39.00	50.00	55.00	67.00	86.00	▂▇▅▂▃
V41	1	933.93	1001.68	0.00	450.00	685.42	1053.42	11949.33	▇▁▁▁▁
V42	1	427.57	715.39	0.00	180.00	270.00	460.98	11500.00	▇▁▁▁▁
V43	1	255.44	534.31	0.00	90.80	161.65	265.51	9593.48	▇▁▁▁▁
V44	1	106.11	135.62	0.00	50.12	73.85	125.81	1748.13	▇▁▁▁▁
V45	1	5.01	5.03	0.00	2.37	3.85	6.32	76.63	▇▁▁▁▁
V46	1	0.13	0.33	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
V47	1	7985.72	6854.50	2051.50	3760.57	5509.43	9521.93	55128.46	▇▁▁▁▁
V48	1	61.69	10.41	35.95	65.72	65.93	66.13	66.45	▂▁▁▁▇
V49	1	8.12	2.91	5.81	6.34	7.22	7.84	15.44	▇▂▁▁▂

After inspecting the distribution in more detail, we notice the following:

The target feature oilspill is highly imbalanced. There are 896 observations without and only 41 with an oil spill
The feature V23 is constantly 0 and we can remove it
The feature V33 only has 4 non-zero values and we could drop it, as we do not expect our machine learning algorithm to learn a lot from it
The features are not scaled. Because we will only use tree-based learners in this exercise, this is not a problem

oil$V33 = NULL
oil$V23 = NULL

1 Benchmarking standard algorithms

We will start by comparing two standard ML algorithms - a classification tree and a random forest - without taking the imbalanced class distribution into account. Inspecting some standard measures will reveal problems that will be addressed in the subsequent exercises.

Start by creating a classification task with "oilspill" as the target variable and all other variables as features (except the ones we removed earlier). Set “yes” as the positive class and stratify with respect to the target variable to ensure the same class distribution in each fold.

Then, compare a classification tree with a random forest with respect to their accuracy, FPR and TPR. As a validation strategy we use repeated (5 times) 3-fold stratified crossvalidation. We only use 3 folds because we have very few positive labels and we repeat the crossvalidation because the dataset is small.

Inspect the results and answer whether accuracy is a good metric for this problem.

Recap: Stratification

Stratification consists of dividing the population into subsets (called strata) within each of which an independent sample is selected. When setting the column role "stratum", resamplings that are applied to the task will automatically take the set stratum into account.

For binary classification problems, the column role "positive" is important, because metrics like the TPR and FPR can only be understood when knowing what is defined as the positive class.

Hint 1:

The function as_task_classif() can help to create a classification task from the dataset. You can set the positive class and the stratification role by changing task$positive and task$col_roles$stratum after creating the task.

Then, create the benchmark design using benchmark_grid() and execute it using benchmark(). To calculate the performance metrics use the method $aggregate(), which takes a list of measures that can be constructed using msrs().

Hint 2:

library(mlr3verse)
task = as_task_classif(...)
task$col_roles$stratum = ...
task$positive = ...

learners = lrns(...)
design = benchmark_grid(...)
bmr = benchmark(...)
bmr$aggregate(...)

Solution

2 Selecting a suitable performance metric

When selecting a suitable performance metric, we have to take the properties of the task into consideration. The detection of a spill requires mobilizing an expensive response, and missing an event is equally expensive, causing damage to the environment. Therefore, both class labels are important.

For that reason we want to select a measure that is insensitive to changes in the class distribution in the test data. Can you modify the definition of the accuracy so that it does not depend on the distribution of the target variable? Compare this new metric with the standard accuracy used in the previous exercise.

Hint 1:

The accuracy can be defined as \[ACC = P(\hat{Y} = 1 | Y = 1) \times P(Y = 1) + P(\hat{Y} = 0 | Y = 0) \times P(Y = 0)\] Note that $TPR = P(\hat{Y} = 1 | Y = 1)$ is the true positive rate and $TNR = P(\hat{Y} = 0 | Y = 0)$ the true negative rate, where TNR = 1 - FPR (false positive rate). Hence, the accuracy can be viewed as a weighted average of TPR and TNR, where $P(Y = 1)$ (and $P(Y = 0) = 1 - P(Y = 1)$) are used as the corresponding weights. The new metric should not depend on the class distribution $P(Y = 1)$ (and $P(Y = 0) = 1 - P(Y = 1)$), i.e., we can equally weight the TPR and TNR to obtain a metric that does not take into account the class distribution. The resulting metric is known as the balanced accuracy: $BACC = 0.5 \cdot TPR + 0.5 \cdot TNR = 0.5 \cdot TPR + 0.5 \cdot (1-FPR)$

Hint 2:

Read msr("classif.bacc")$help().

Solution

3 Upsamling the minority class

Although we have selected a performance metric (balanced accuracy) that is insensitive to the class distribution in the test data, we have considerably fewer positive than negative observations in the training data. This will make the random forest focus on the latter (no spill).

To address that, create a machine learning pipeline that first upsamples the minority class by a factor of two and then fits a random forest. Add it to the benchmark result and compare the models with respect to the balanced accuracy.

Recap: Upsampling

Upsampling is a procedure where synthetically generated data points (corresponding to minority class) are injected into the dataset.

Hint 1:

Use po("classbalancing") and combine it with the learner. PipeOps can be chained to a graph using %>>%. You can convert a graph to a learner using as_learner() Remember to use the instantiated resampling from the previous benchmark experiment. A new resample result can be added to a benchmark result using c(...).

Hint 2:

graph = po(
  "classbalancing",
  ratio = ...,
  reference = ...,
  adjust = ...
) %>>%
  lrn(...)

resampling = design$resampling[[1L]]

learner_balanced = as_learner(...)

rr = resample(...)

bmr = c(...)

autoplot(bmr, measure = ...)

Solution

4 Additional downsampling of the majority class

Repeat the previous experiment, but now not only upsample the minority class like in the previous exercise, but also downsample the majority class to the same count.

Recap: Downsampling

Downsampling is a mechanism that reduces the count of training samples falling under the majority class.

Hint 1:

Adjust the pipeline from the previous exercise. Have a look what the adjust parameter of po("classbalancing") can be set to.

Hint 2:

graph = po(
  "classbalancing",
  ratio = ...,
  reference = ...,
  adjust = ...
) %>>%
  ...

learner_balanced_all = as_learner(graph)
learner_balanced_all$id = ... # Set an id so that you can distinguish the pipelines when plotting the results
...

Solution

5 Instance-specific weights

Add another logistic learner to the benchmark that uses instance-specific weights, assigning each observation in the minorty class double the weight compared to the majority class. Further, add a simple logistic regression learner to assess the additional performance difference due to instance-specific weights-

Hint 1:

Use a graph that contains po("classweights") to specify the weights before the model training.

Solution

Bonus exercise: Tuning the sampling rate

In the previous exercise we set the ratio to 2. See if tuning this value improves the result. Construct an auto_tuner with a 3-fold inner CV and a grid search over a suitable range of ratio.

Hint 1:

graph = po("classbalancing", reference = "minor", adjust = "all") %>>%
  lrn("classif.ranger")

learner_balanced_tuned = as_learner(graph)

search_space = ps(classbalancing.ratio = ...)

balanced_autotuner = auto_tuner(
  tuner = tnr(..., resolution = ...),
  learner = ...,
  resampling = rsmp(...),
  measure = msr(...),
  search_space = search_space
)

rr = resample(task, balanced_autotuner, resampling)

bmr = c(bmr, rr)

autoplot(bmr, measure = msr("classif.bacc"))

Solution

Summary

In this exercise we addressed the problem of imbalanced class distribution in classification problems. We have seen that standard metrics like accuracy can be misleading for such problems and learned how it can be modified to obtain a balanced accuracy. We then learned how to change the training distribution - using up- and downsampling - to improve the results. Finally, we got a better understanding of the importance of using stratification when having only very few positive labels.

A similar usecase can be found here: https://mlr-org.com/gallery/2020-03-30-imbalanced-data/