Imputation Homework

Learn the basics of imputation (i.e. filling in missing data) with mlr3pipelines.

Authors

Goal

Our goal for this homework is to learn the basics of imputation within the mlr3 universe, specifically, mlr3pipelines. Imputation is the process of filling in missing data in a data set using statistical methods like mean, median, mode, or predictive models.

Required packages

We will use mlr3 for machine learning and mlr3oml for data access from OpenML:

library(mlr3verse)
library(mlr3tuning)
library(mlr3oml)
set.seed(12345)

Data: Washington bike rentals

We will use bike sharing data for 731 days in Washington, D.C, where the target variable is "rentals".

Let’s load the data and remove an unwanted column:

bikes = as.data.frame(odt(id = 45103)$data)[,-c(10)]
INFO  [13:06:23.427] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/45103`, authenticated: `FALSE`}
INFO  [13:06:24.003] Retrieving ARFF {url: `https://api.openml.org/data/v1/download/22112196/dailybike.arff`, authenticated: `FALSE`}
INFO  [13:06:24.276] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/features/45103`, authenticated: `FALSE`}
bikes$rentals = as.numeric(as.character(bikes$rentals))

Further, we artificially generate missing data entries for the feature temp:

rows <- sample(nrow(bikes), 300, replace = FALSE)
bikes[rows, "temp"] <- NA

Compare different learners

In this exercise, we want to compare the performance of two learners that are used to impute the missing values in temp: a linear model (LM) and a k-nearest neighbor (kNN), where we want to tune the hyperparameter k. We benchmark the performance of a pipeline that connects the different imputation methods with a random forest learner.

Construct a pipeline graph

First, we need a pipeline that contains both imputation methods as alternatives, effectively treating them as a hyperparameter. This is then connected to the random forest learner. Define and plot the appropriate graph object.

Hint 1:

Expressing two competing imputation methods in a graph can be done with branching, see more in ??mlr_pipeops_branch.

Tunable HPs

We want to tune a number of hyperparameters in the pipeline: 1) The imputation method, as represented in the graph, and 2) the k parameter of the kNN-based imputation method for values from 1 to 8. Define an appropriate search space.

Hint 1:

Remember that a graph can be treated as any other learner, and therefore, its parameter set can be accessed correspondingly. This means you can find the relevant parameter names in the correct field of the graph object.

Hint 2:

A parameter space can be defined using the ps() sugar function.

Tune the pipeline and visualize results

Create a task for the bike rental data and tune a graph learner with grid search over the defined search space, using 4-fold CV repeated 3 times, MSE as performance measure and no terminator. Visualize and interpret the results.

Hint 1:
task = ...

instance = tune(...)

autoplot(...)