library(mlr3verse)
library(mlr3tuning)
library(mlr3oml)
set.seed(12345)
Goal
Our goal for this exercise sheet is to learn the basics of imputation within the mlr3
universe, specifically, mlr3pipelines
. Imputation is the process of filling in missing data in a data set using statistical methods like mean, median, mode, or predictive models.
Required packages
We will use mlr3
for machine learning and mlr3oml
for data access from OpenML:
Data: Miami house prices
We will use house price data on 13,932 single-family homes sold in Miami in 2016. The target variable is "SALE_PRC"
.
Let’s load the data and remove an unwanted column:
= as.data.frame(odt(id = 43093)$data[,-c(3)]) miami
INFO [15:53:06.868] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/43093`, authenticated: `FALSE`}
INFO [15:53:07.310] Retrieving ARFF {url: `https://api.openml.org/data/v1/download/22047757/MiamiHousing2016.arff`, authenticated: `FALSE`}
INFO [15:53:07.793] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/features/43093`, authenticated: `FALSE`}
1:16] = lapply(miami[1:16], as.numeric)
miami[c(14,16)] = lapply(miami[,c(14,16)], as.factor) miami[,
Further, we artificially generate missing data entries for three features:
= which(miami$age > 50)
indices
for (i in c("OCEAN_DIST", "TOT_LVG_AREA", "structure_quality")) {
<- sample(indices, 2000, replace = FALSE)
sample_indices <- NA
miami[sample_indices, i] }
1 Create simple imputation PipeOps
Imputation can be executed via standard pipeline workflows using PipeOp
objects. You can get an overview of the relevant options with ?PipeOpImpute
, which is the abstract base class for feature imputation. Create a PipeOp
that imputes numerical features based on randomly sampling feature values from the non-missing values and another PipeOp
that imputes factor/categorical (including ordinal) features by out of range imputation. The latter introduces a new level “.MISSING” for missings.
Hint 1:
You can set up a PipeOp
with the po()
function and use the affect_columns
argument to address the columns to which the preprocessing should be applied (see also ?PipeOpImpute
for how to use the affect_columns
argument). There exists a shortcut for setting up the imputation based on randomly sampling feature values from the non-missing values which is imputesample
(see also ?PipeOpImputeSample
) and for out of range imputation which is imputeoor
(see also ?PipeOpImputeOOR
).
Hint 2:
= po("...", affect_columns = selector_type("..."))
impute_numeric = po("...", affect_columns = ...(c("factor", "ordered"))) impute_factor
2 Create and plot a graph
Combine both imputation PipeOps
with a random forest learning algorithm into a Graph
. Then, plot the graph.
Hint 1:
Create a random forest learner using lrn()
. You can concatenate different pre-processing steps and a learner using the %>>%
operator.
Hint 2:
You can plot a graph using the corresponding R6 method of the graph object.
Simple Imputation
Alternative to a pipeline that includes a learner, we can even set up a simpler pipeline that only creates imputations for missing data and apply it to a data set. For this, define first a simpler pipeline with only the imputation steps from above, create a task for the miami
data, use the $train()
method to impute the missing rows. Then, inspect the imputed data set with ...[[1]]$head()
.
Assessing Performance
Use 3-fold cross-validation to estimate the error of the first pipeline (the one that contains a random forest learner) stored in the graph.
Hint 1:
Specifically, you need three things:
- A
Resampling
object usingrsmp()
and instantiate the train-test splits on the task. - Use this object together with the task and the graph learner specified above as an input to the
resample()
method. - Measure the performance with
$aggregate()
.
Hint 2:
= rsmp("cv", ...)
resampling $instantiate(...)
resampling= resample(task = ..., learner = ..., resampling = ...)
rr $...() rr
3 Model-based imputation
We can use a learner to impute missing values, which works by learning a model that treats the feature to-be-imputed as target and the other features as covariates. This has to be done separately for each feature that we impute. Obviously, the performance of learner-based imputation can depend on the type of learner used. Set up two distinct pipelines, modifying the pipeline from the previous exercise. Now, for numeric features, use learner-based imputation, using a linear model for the first and a decision tree for the second pipeline.
Hint 1:
You can learn about the mechanics of using learners for imputation in ?mlr_pipeops_imputelearner
.
Hint 2:
As the documentation states, if a learner used for imputation is itself supposed to train on features containing missing data, it needs to be able handle missing data natively. Otherwise, it needs its own imputation, requiring a more complicated pipeline. In this case, use histogram-based imputation within the learner-based imputation. Similarly, if categorical features are to be imputed, they need to be imputed before the numeric features in this case.
Assessing Performance
As before, use 3-fold cross-validation to compare the error of the two pipelines to identify which learner seems to work best for imputation for this data set.
3 Branches in pipelines
Pipelines can become very complex. Within a pipeline, we could be interested which imputation method works best. An elegant way to find out is to treat the imputation method as just another hyperparameter that we tune alongside other hyperparameters when we tune the pipeline. A way to do this is by using path branching. Set up a graph that contains the following elements: 1. A branch with two different imputation methods, a) histogram-based and b) learner-based using a decision tree 2. A random forest fit on the (fully imputed) data.
Hint 1:
You can read more about branching in ??mlr_pipeops_branch
.
Define a search space
We want to tune a number of hyperparameters in the pipeline: 1) The mtry
parameter in the random forest between 2 and 8, 2) The imputation method, as represented by our graph, and 3) the maxdepth
parameter of the decision tree-based imputation between 1 and 30.
Hint 1:
Remember that a graph can be treated as any other learner, and therefore, its parameter set can be accessed correspondingly. This means you can find the relevant parameter names in the correct field of the graph object.
Hint 2:
A parameter space can be defined using the ps()
sugar function.
Tuning the pipeline
Now, tune the pipeline using an AutoTuner with 3-fold CV and random search. You can terminate after 10 evaluations to reduce run time. Then, display the optimal hyperparameter set as chosen by the tuner based on the mean squared error.
Hint 1:
# AutoTuner
= AutoTuner$new(...)
glrn_tuned # Train
...# Optimal HP set
...