JavaScript is required to unlock solutions.
Please enable JavaScript and reload the page,
or download the source files from GitHub and run the code locally.

Goal

Our goal for this homework is to learn the basics of imputation within the mlr3 universe, specifically, mlr3pipelines. Imputation is the process of filling in missing data in a data set using statistical methods like mean, median, mode, or predictive models.

Required packages

We will use mlr3 for machine learning and mlr3oml for data access from OpenML:

library(mlr3verse)
library(mlr3tuning)
library(mlr3oml)
set.seed(12345)

Data: Washington bike rentals

We will use bike sharing data for 731 days in Washington, D.C, where the target variable is "rentals".

Let’s load the data and remove an unwanted column:

bikes = as.data.frame(odt(id = 45103)$data)[,-c(10)]

INFO  [13:06:23.427] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/45103`, authenticated: `FALSE`}
INFO  [13:06:24.003] Retrieving ARFF {url: `https://api.openml.org/data/v1/download/22112196/dailybike.arff`, authenticated: `FALSE`}
INFO  [13:06:24.276] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/features/45103`, authenticated: `FALSE`}

bikes$rentals = as.numeric(as.character(bikes$rentals))

Further, we artificially generate missing data entries for the feature temp:

rows <- sample(nrow(bikes), 300, replace = FALSE)
bikes[rows, "temp"] <- NA

Compare different learners

In this exercise, we want to compare the performance of two learners that are used to impute the missing values in temp: a linear model (LM) and a k-nearest neighbor (kNN), where we want to tune the hyperparameter k. We benchmark the performance of a pipeline that connects the different imputation methods with a random forest learner.

Construct a pipeline graph

First, we need a pipeline that contains both imputation methods as alternatives, effectively treating them as a hyperparameter. This is then connected to the random forest learner. Define and plot the appropriate graph object.

Hint 1:

Expressing two competing imputation methods in a graph can be done with branching, see more in ??mlr_pipeops_branch.

Solution

Tunable HPs

We want to tune a number of hyperparameters in the pipeline: 1) The imputation method, as represented in the graph, and 2) the k parameter of the kNN-based imputation method for values from 1 to 8. Define an appropriate search space.

Hint 1:

Remember that a graph can be treated as any other learner, and therefore, its parameter set can be accessed correspondingly. This means you can find the relevant parameter names in the correct field of the graph object.

Hint 2:

A parameter space can be defined using the ps() sugar function.

Solution

Tune the pipeline and visualize results

Create a task for the bike rental data and tune a graph learner with grid search over the defined search space, using 4-fold CV repeated 3 times, MSE as performance measure and no terminator. Visualize and interpret the results.

Hint 1:

task = ...

instance = tune(...)

autoplot(...)

Solution