library("mlr3verse")
library("data.table")
= tsk("german_credit") task
Goal
Learn how to rank features of a supervised task by their importance / strength of relationship with the target variable using a feature filter method.
German Credit Dataset
We create the task as for the resampling exercise: The German Credit Data set.
Exercises
Within the mlr3
ecosystem, feature filters are implemented in the mlr3filters
package and are typically used in combination with mlr3pipelines
to be able to include the whole preprocessing step in a pipeline. In exercises 1 to 3, we apply feature filtering to preprocess the data of a task without using a pipeline. In exercise 4, we will set up a pipeline that combines a learner with the feature filtering as preprocessing step.
Exercise 1: Find a suitable Feature Filter
Make yourself familiar with the mlr3filters
package (link). Which Filter
s are applicable to all feature types from the task we created above?
Hint:
Some filters are only applicable to either classification or regression or either numeric or categorical features. Therefore, we are looking for a Filter
that is applicable to our classification task and that can be computed for integer
and factor
features (as these types of features are present in task, see task$feature_types
).
The website linked above includes a table that provides detailed information for each Filter
.
Exercise 2: Information Gain Filter
We now want to use the information_gain
filter which requires to install the FSelectorRcpp
package. This filter quantifies the gain in information by considering the following difference: H(Target) + H(Feature) - H(Target, Feature)
Here, H(X)
is the Shannon entropy for variable X
and H(X, Y)
is the joint Shannon entropy for variable X
conditioned on Y
.
Create an information gain filter and compute the information gain for each feature.
Visualize the score for each feature and decide how many and which features to include.
Hint 1:
Use flt("information_gain")
to create an information_gain
filter and calculate the filter scores of the features. See ?mlr_filters_information_gain
(or equivalently flt("information_gain")$help()
) for more details on how to use a filter. If it does not work, you can use e.g. flt("importance", learner = lrn("classif.rpart"))
which uses the feature importance of a classif.rpart
decision tree to rank the features for the feature filter.
For visualization, you can, for example, create a scree plot (similar as in principle component analysis) that plots the filter score for each feature on the y-axis and the features on the x-axis.
Using a rule of thumb, e.g., the ‘’elbow rule’’ you can determine the number of features to include.
Hint 2:
library(mlr3filters)
library(mlr3viz)
library(FSelectorRcpp)
= flt(...)
filter $calculate()
filterautoplot(...)
Exercise 3: Create and Apply a PipeOpFilter to a Task
Since the k-NN learner suffers from the curse of dimensionality, we want set up a preprocessing PipeOp
to subset our set of features to the 5 most important ones according to the information gain filter (see flt("information_gain")$help()
). In general, you can see a list of other possible filters by looking at the dictionary as.data.table(mlr_filters)
. You can construct a PipeOp
object with the po()
function from the mlr3pipelines
package. See mlr_pipeops$keys()
for possible choices. Create a PipeOp
that filters features of the german_credit
task and creates a new task containing only the 5 most important ones according to the information gain filter.
Hint 1:
- The filter can be created by
flt("information_gain")
(see also the help pageflt("information_gain")$help()
). - In our case, we have to pass the
"filter"
key to the first argument of thepo()
function and the filter previously created with theflt
function to thefilter
argument of thepo()
function to construct aPipeOpFilter
object that performs feature filtering (see also code examples in the help page?PipeOpFilter
). - The help page of
?PipeOpFilter
also reveals the parameters we can specify. For example, to select the 5 most important features, we can setfilter.nfeat
. This can be done using theparam_vals
argument of thepo()
function during construction or by adding the parameter value to theparam_set$values
field of an already createdPipeOpFilter
object (see also code examples in the help page). - The created
PipeOpFilter
object can be applied to aTask
object to create the filteredTask
. To do so, we can use the$train(input)
field of thePipeOpFilter
object and pass a list containing the task we want to filter.
Hint 2:
library(mlr3pipelines)
# Set the filter.nfeat parameter directly when constructing the PipeOp:
= po("...",
pofilter filter = flt(...),
... = list(filter.nfeat = ...))
# Alternative (first create the filter PipeOp and then set the parameter):
= po("...", filter = flt(...))
pofilter $...$filter.nfeat = ...
pofilter
# Train the PipeOpFilter on the task
= pofilter$train(input = list(...))
filtered_task
filtered_task task
Exercise 4: Combine PipeOpFilter with a Learner
Do the following tasks:
- Combine the
PipeOpFilter
from the previous exercise with a k-NN learner to create a so-calledGraph
(it can contain multiple preprocessing steps) using the%>>%
operator. - Convert the
Graph
to aGraphLearner
so that it behaves like a new learner that first does feature filtering and then trains a model on the filtered data and run theresample()
function to estimate the performance of theGraphLearner
with a 5-fold cross-validation. - Change the value of the
nfeat.filter
parameter (which was set to 5 in the previous exercise) and run againresample()
.
Hint 1:
- Create a kNN learner using
lrn()
. Remember that the shortcut for a kNN classifier ist"classif.kknn"
. - You can concatenate different preprocessing steps and a learner using the
%>>%
operator. - Use
as_learner
to create aGraphLearner
(see also the code examples in the help page?GraphLearner
).
Hint 2:
library(mlr3learners)
= ... %>>% lrn("...")
graph = as_learner(...)
glrn = resample(task = ..., learner = ..., resampling = ...)
rr $aggregate()
rr
# Change `nfeat.filter` and run resampling again using same train-test splits
...= resample(task = ..., learner = ..., resampling = rr$resampling)
rr2 $aggregate() rr2
Summary
We learned how to use feature filters to rank the features w.r.t. a feature filter method in a supervised setting and how to subset a task accordingly.
Ideally, feature filtering is directly incorporated into the learning procedure by making use of a pipeline so that performance estimation after feature filtering is not biased.