library(mlr3verse)
library(mlr3tuning)
library(mlr3oml)
set.seed(12345)
Goal
Our goal for this homework is to learn the basics of imputation within the mlr3
universe, specifically, mlr3pipelines
. Imputation is the process of filling in missing data in a data set using statistical methods like mean, median, mode, or predictive models.
Required packages
We will use mlr3
for machine learning and mlr3oml
for data access from OpenML:
Data: Washington bike rentals
We will use bike sharing data for 731 days in Washington, D.C, where the target variable is "rentals"
.
Let’s load the data and remove an unwanted column:
= as.data.frame(odt(id = 45103)$data)[,-c(10)] bikes
INFO [13:06:23.427] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/45103`, authenticated: `FALSE`}
INFO [13:06:24.003] Retrieving ARFF {url: `https://api.openml.org/data/v1/download/22112196/dailybike.arff`, authenticated: `FALSE`}
INFO [13:06:24.276] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/features/45103`, authenticated: `FALSE`}
$rentals = as.numeric(as.character(bikes$rentals)) bikes
Further, we artificially generate missing data entries for the feature temp
:
<- sample(nrow(bikes), 300, replace = FALSE)
rows "temp"] <- NA bikes[rows,
Compare different learners
In this exercise, we want to compare the performance of two learners that are used to impute the missing values in temp
: a linear model (LM) and a k-nearest neighbor (kNN), where we want to tune the hyperparameter k
. We benchmark the performance of a pipeline that connects the different imputation methods with a random forest learner.
Construct a pipeline graph
First, we need a pipeline that contains both imputation methods as alternatives, effectively treating them as a hyperparameter. This is then connected to the random forest learner. Define and plot the appropriate graph object.
Hint 1:
Expressing two competing imputation methods in a graph can be done with branching, see more in ??mlr_pipeops_branch
.
Tunable HPs
We want to tune a number of hyperparameters in the pipeline: 1) The imputation method, as represented in the graph, and 2) the k
parameter of the kNN-based imputation method for values from 1 to 8. Define an appropriate search space.
Hint 1:
Remember that a graph can be treated as any other learner, and therefore, its parameter set can be accessed correspondingly. This means you can find the relevant parameter names in the correct field of the graph object.
Hint 2:
A parameter space can be defined using the ps()
sugar function.
Tune the pipeline and visualize results
Create a task for the bike rental data and tune a graph learner with grid search over the defined search space, using 4-fold CV repeated 3 times, MSE as performance measure and no terminator. Visualize and interpret the results.
Hint 1:
= ...
task
= tune(...)
instance
autoplot(...)