Imputation

Learn the basics of imputation (i.e. filling in missing data) with mlr3pipelines.

Authors

Goal

Our goal for this exercise sheet is to learn the basics of imputation within the mlr3 universe, specifically, mlr3pipelines. Imputation is the process of filling in missing data in a data set using statistical methods like mean, median, mode, or predictive models.

Required packages

We will use mlr3 for machine learning and mlr3oml for data access from OpenML:

library(mlr3verse)
library(mlr3tuning)
library(mlr3oml)
set.seed(12345)

Data: Miami house prices

We will use house price data on 13,932 single-family homes sold in Miami in 2016. The target variable is "SALE_PRC".

Let’s load the data and remove an unwanted column:

miami = as.data.frame(odt(id = 43093)$data[,-c(3)])
INFO  [15:53:06.868] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/43093`, authenticated: `FALSE`}
INFO  [15:53:07.310] Retrieving ARFF {url: `https://api.openml.org/data/v1/download/22047757/MiamiHousing2016.arff`, authenticated: `FALSE`}
INFO  [15:53:07.793] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/features/43093`, authenticated: `FALSE`}
miami[1:16] = lapply(miami[1:16], as.numeric)
miami[,c(14,16)] = lapply(miami[,c(14,16)], as.factor)

Further, we artificially generate missing data entries for three features:

indices = which(miami$age > 50)

for (i in c("OCEAN_DIST", "TOT_LVG_AREA", "structure_quality")) {
  sample_indices <- sample(indices, 2000, replace = FALSE)
  miami[sample_indices, i] <- NA
}

1 Create simple imputation PipeOps

Imputation can be executed via standard pipeline workflows using PipeOp objects. You can get an overview of the relevant options with ?PipeOpImpute, which is the abstract base class for feature imputation. Create a PipeOp that imputes numerical features based on randomly sampling feature values from the non-missing values and another PipeOp that imputes factor/categorical (including ordinal) features by out of range imputation. The latter introduces a new level “.MISSING” for missings.

Hint 1:

You can set up a PipeOp with the po() function and use the affect_columns argument to address the columns to which the preprocessing should be applied (see also ?PipeOpImpute for how to use the affect_columns argument). There exists a shortcut for setting up the imputation based on randomly sampling feature values from the non-missing values which is imputesample (see also ?PipeOpImputeSample) and for out of range imputation which is imputeoor (see also ?PipeOpImputeOOR).

Hint 2:
impute_numeric = po("...", affect_columns = selector_type("..."))
impute_factor = po("...", affect_columns = ...(c("factor", "ordered")))

2 Create and plot a graph

Combine both imputation PipeOps with a random forest learning algorithm into a Graph. Then, plot the graph.

Hint 1:

Create a random forest learner using lrn(). You can concatenate different pre-processing steps and a learner using the %>>% operator.

Hint 2:

You can plot a graph using the corresponding R6 method of the graph object.

Simple Imputation

Alternative to a pipeline that includes a learner, we can even set up a simpler pipeline that only creates imputations for missing data and apply it to a data set. For this, define first a simpler pipeline with only the imputation steps from above, create a task for the miami data, use the $train() method to impute the missing rows. Then, inspect the imputed data set with ...[[1]]$head().

Assessing Performance

Use 3-fold cross-validation to estimate the error of the first pipeline (the one that contains a random forest learner) stored in the graph.

Hint 1:

Specifically, you need three things:

  1. A Resampling object using rsmp() and instantiate the train-test splits on the task.
  2. Use this object together with the task and the graph learner specified above as an input to the resample() method.
  3. Measure the performance with $aggregate().
Hint 2:
resampling = rsmp("cv", ...)
resampling$instantiate(...)
rr = resample(task = ..., learner = ..., resampling = ...)
rr$...()

3 Model-based imputation

We can use a learner to impute missing values, which works by learning a model that treats the feature to-be-imputed as target and the other features as covariates. This has to be done separately for each feature that we impute. Obviously, the performance of learner-based imputation can depend on the type of learner used. Set up two distinct pipelines, modifying the pipeline from the previous exercise. Now, for numeric features, use learner-based imputation, using a linear model for the first and a decision tree for the second pipeline.

Hint 1:

You can learn about the mechanics of using learners for imputation in ?mlr_pipeops_imputelearner.

Hint 2:

As the documentation states, if a learner used for imputation is itself supposed to train on features containing missing data, it needs to be able handle missing data natively. Otherwise, it needs its own imputation, requiring a more complicated pipeline. In this case, use histogram-based imputation within the learner-based imputation. Similarly, if categorical features are to be imputed, they need to be imputed before the numeric features in this case.

Assessing Performance

As before, use 3-fold cross-validation to compare the error of the two pipelines to identify which learner seems to work best for imputation for this data set.

3 Branches in pipelines

Pipelines can become very complex. Within a pipeline, we could be interested which imputation method works best. An elegant way to find out is to treat the imputation method as just another hyperparameter that we tune alongside other hyperparameters when we tune the pipeline. A way to do this is by using path branching. Set up a graph that contains the following elements: 1. A branch with two different imputation methods, a) histogram-based and b) learner-based using a decision tree 2. A random forest fit on the (fully imputed) data.

Hint 1:

You can read more about branching in ??mlr_pipeops_branch.

Define a search space

We want to tune a number of hyperparameters in the pipeline: 1) The mtry parameter in the random forest between 2 and 8, 2) The imputation method, as represented by our graph, and 3) the maxdepth parameter of the decision tree-based imputation between 1 and 30.

Hint 1:

Remember that a graph can be treated as any other learner, and therefore, its parameter set can be accessed correspondingly. This means you can find the relevant parameter names in the correct field of the graph object.

Hint 2:

A parameter space can be defined using the ps() sugar function.

Tuning the pipeline

Now, tune the pipeline using an AutoTuner with 3-fold CV and random search. You can terminate after 10 evaluations to reduce run time. Then, display the optimal hyperparameter set as chosen by the tuner based on the mean squared error.

Hint 1:
# AutoTuner
glrn_tuned = AutoTuner$new(...)
# Train
...
# Optimal HP set
...