Feature Importance in Decision Trees

2025-02-28

In decision trees, feature importance quantifies the contribution of each feature to the model’s predictive performance. It is calculated based on the reduction in a chosen impurity measure—such as Gini impurity or entropy—for classification tasks, or variance for regression tasks—achieved by splits involving the feature across all nodes in the tree.

Calculating feature importance

Impurity Reduction: when a node is split, the impurity decreases. This reduction is calculated as the difference between the impurity of the parent node and the weighted sum of the impurities of its children
Aggregating Impurity Reductions: For each feature, sum the impurity reductions for all nodes where the feature is used for splitting
Normalization: normalize the feature importances by dividing each feature’s total impurity reduction by the sum of all features’ impurity reductions.

Given a node N, let $I(N)$ denote it’s impurity; if $N_L$ and $N_R$ are its children, then the impurity reduction for N is

$$ \Delta I(N) = I(N) - (\frac{|N_L|}{|N|} I(N_L) + \frac{|N_R|}{|N|} I(N_R)) $$

The importance of feature $f$, $FI(f)$, is calculated as

$$ FI(f) = \frac{1}{Z} \sum_{N \text{uses} f} \Delta I(N) $$

where

$$ Z = \sum_f \sum_{N \text{uses} f} \Delta I(N) $$

Considerations

Features with numerous unique values (e.g., continuous variables) may appear more important because they can create more precise splits, even if they’re not truly more informative.
To address potential biases, consider using permutation feature importance, which assesses the impact of feature value shuffling on model performance.

#machine learning #ml #machine_learning #programming #statistics #information gain #gini index #entropy #cart