where \(\overline{y}_{R_t}\) is the mean of the variables in partition \(R_t\)
Algorithm I
How are the partitions \(R_t\) determined?
Beginning from the first node, each split is determined by the variable and value which minimizes the RSS (regression problem) or impurity (classification problem)
End up with the most complex tree explaining every data point
Pruning the tree
Pre-pruning: additional splits are created until cost of depth is too high
Post-pruning: complex tree is created and cut down to a size where cost is met
Post-pruning is preferred because stopping too early bears the danger of missing some split down the line
Algorithm II
How is the tuning parameter \(\lambda\) determined?
Employing cross-validation, we choose the parameter value which minimizes some prediction error in the test/validation data (more on that later)
Exact implementation of the algorithm differs
Different packages may use different impurity/information measures (e.g., entropy) and different cost functions/tuning parameters
e.g., rpart uses post-pruning where the complexity parameter cp regulates the minimum gain in \(R^2\) or minimum decrease of Gini impurity to create another node
Growing the Tree
Advantages and Drawbacks
Trees have many advantages
Easy to understand and very flexible
No theoretical assumptions needed
Computationally cheap
Automatic feature selection
No limitation in number of features
Do not necessarily discard observations
Drawbacks include
No immediate (causal) interpretation of decision
Algorithm is “greedy”
The curse of dimensionality
The Curse of Dimensionality
Volume of the support space increases exponentially with dimensions
Data points become dispersed in a high-dimensional space
Splits become very volatile
To mitigate the problem, random forest is used in practice
A Perfect Summary I
Performance Measurement
Connecting to statistical theory
Root Mean Squared Error
For a regression problem, RMSE is typically used: \[ RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_{i} - \hat{y}_{i})^{2}} \]
Or alternatively the (adjusted) \(R^2\): \[ R^2 = 1 - \frac{\sum_{i=1}^{n}({y_i}-\hat{y_i})^2}{\sum_{i=1}^{n}(y_i-\bar{y})^2} \]
Bias–Variance Trade-Off
The mean squared error \(MSE\) of a prediction can be written as:
Let’s assume the Austrian population is getting tested for the Coronavirus
1% of the population is indeed infected, meaning \(P(C) = 0.01\)
A test is 98% accurate, in the sense that \(P(\text{Test}_C \mid C) = P(\text{Test}_{\neg C} \mid \neg C) = 0.98\)
What is the probability that I have the Coronavirus given that I tested positive? \(P(C \mid \text{Test}_C)\)?
Which value from the confusion matrix did we calculate here (and which ones were given)?
Judging Performance in Classification Problems
All the indicators in the confusion matrix can be relevant
Accuracy depends on base rate
accuracy of 96% in sample with 95% positives → poor performance
accuracy of 80% in sample with 50% positives → good performance
Need for adjustment of values relative to a “naive prediction”
Concepts can be generalized to classification problems with more than two categories
Adjusted Performance Measures
Cohen’s kappa: \[ \kappa = \frac{Acc_{mod} - Acc_{0}}{1 - Acc_{0}} \] where \(Acc_{mod}\) is the accuracy of our model and \(Acc_{0}\) is the expected random accuracy
For any sensible model, it holds that \(0 < \kappa < 1\)
If \(\kappa < 0\), our model would be worse than guessing at random
\(\kappa\) tells you “how far you are away from predicting perfectly compared to a naive prediction”
ROC- and PR-Curve
Trade-off between the measures, which we can exploit by setting the cut-off point for predicted scores
Receiver Operating Characteristic (ROC) curve
Plot of sensitivity against 1-specificity
Helps identify which cut-off point to choose
Precision-Recall (PR) curve
Plot of precision against sensitivity (recall)
Might be preferred over ROC when categories are very imbalanced
Receiver Operating Characteristic
The trade-off through the setting of the cut-off point can be visualized in a ROC curve
Receiver Operating Characteristic
ROC criterion
There is a point on the curve where the loss of sensitivity and specificity are equal
This value can be used to determine an “optimal” cut-off point
Sometimes high sensitivity is preferred over specificity (e.g., medical tests), or vice versa
Area Under the Curve (AUC)
The area underneath the ROC curve can be used as another performance measure
Different models achieve different sensitivity and specificity at the same cut-off point
The model with the highest AUC is best at holding the trade-off low
The AMAS has been criticized for intransparency (among other things)
One of the few publicly available documents states the usage of a logit model (which was, contrary to public belief, never implemented) for this algorithm
Or rather, two logit models:
One predicting your short-term chance of labor market integration (assignment to group A if chances are high)
One predicting your long-term labor market integration (assignment to group C if chances are low), given a set of demographic variables and your labor market history
The assignment to groups A, B, and C determines if you are eligible for certain types of subsidies
The Infamous “AMS-Algorithmus” II
In the documentation, it is correctly stated that the cut-off point can be chosen in a way to balance sensitivity and specificity (ignoring the strange definitions of Sensitivity and Specificity)
The Infamous “AMS-Algorithmus” III
What is shown here (and what isn’t)?
The Infamous “AMS-Algorithmus” IV
The cut-off points were set (manually) at 25% for group C and 66% for group A
By setting the cut-off points low for group C and high for group A, they achieved high precision in those two classes
The precision in group B is not shown here, as well as other measures that would help us judge the predictive performance
Cross Validation
The real magic behind supervised machine learning
The Problem of Overfitting
Models achieve an extremely good fit through the usage of many variables and high depth
Danger of underestimating the random error \(\sigma^2\) of the data-generating process
Results in a model with low in-sample prediction error but high out-of-sample error
The Solution to Overfitting
Split your data into a training data set and a test data set
The model is estimated using the training data, and performance measures are calculated using only the test data
Repeat the process for different parameter values (e.g., for the penalty term \(\lambda\)) and choose the value which optimizes some performance measure over the test data
This serves two purposes
As a performance measure independent of the sample where the model was fit
To choose hyperparameters such as \(\lambda\) (regularization)
CART
4830 samples
3 predictor
2 classes: 'successful', 'unsuccessful'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 4346, 4348, 4347, 4347, 4347, 4347, ...
Resampling results across tuning parameters:
cp ROC Sens Spec
5e-04 0.7173981 0.6741891 0.6543656
1e-03 0.7169620 0.7072193 0.6465661
5e-03 0.6996839 0.6867633 0.6737720
5e-02 0.6612506 0.7439858 0.5740977
ROC was used to select the optimal model using the largest value.
The final value used for the model was cp = 5e-04.
Extracting the Model
tree <- tree_caret$finalModelrpart.plot(tree, box.palette ="RdBu", nn =FALSE, type =2)