where \(\overline{y}_{R_t}\) is the mean of the variables in partition \(R_t\)
Algorithm I
How are the partitions \(R_t\) determined?
Beginning from the first node, each split is determined by the variable and value which minimizes the RSS (regression problem) or impurity (classification problem)
End up with the most complex tree explaining every data point
Pruning the tree
Pre-pruning: additional splits are created until cost of depth is too high
Post-pruning: complex tree is created and cut down to a size where cost is met
Post-pruning is preferred because stopping too early bears the danger of missing some split down the line
Algorithm II
How is the tuning parameter \(\lambda\) determined?
Employing cross-validation, we choose the parameter value which minimizes some prediction error in the test/validation data (more on that later)
Exact implementation of the algorithm differs
Different packages may use different impurity/information measures (e.g., entropy) and different cost functions/tuning parameters
e.g., rpart uses post-pruning where the complexity parameter cp regulates the minimum gain in \(R^2\) or minimum decrease of Gini impurity to create another node
Growing the Tree
Advantages and Drawbacks
Trees have many advantages
Easy to understand and very flexible
No theoretical assumptions needed
Computationally cheap
Automatic feature selection
No limitation in number of features
Do not necessarily discard observations
Drawbacks include
No immediate (causal) interpretation of decision
Algorithm is “greedy”
The curse of dimensionality
The Curse of Dimensionality
Volume of the support space increases exponentially with dimensions
Data points become dispersed in a high-dimensional space
Splits become very volatile
To mitigate the problem, random forest is used in practice
Synthetic Unemployment Data
synthetic_unemployment_data <-read_parquet("data/synthetic_unemployment_data.parquet")set.seed(123)data <- synthetic_unemployment_data |>mutate(train_index =sample(c("train", "test"), nrow(synthetic_unemployment_data), replace=TRUE, prob=c(0.75, 0.25) ) )train <- data |>filter(train_index=="train")test <- data |>filter(train_index=="test")
target_high
target_low
region_1
promised_employment
benefits
benefits_amount
social_security
sex
age
education
family_situation
nationality
disability
nace1
job_sector
asylum
children
age_youngest_child
migrational_background
employment_1m
employment_unsubsidized_1m
unemployment_1m
out_of_labor_force_1m
employment_3m
unemployment_3m
out_of_labor_force_3m
employment_6m
unemployment_6m
out_of_labor_force_6m
employment_1y
unemployment_1y
out_of_labor_force_1y
employment_2y
unemployment_2y
out_of_labor_force_2y
days_employment_unsubsidized_10j
days_unemployment_10j
days_out_of_labor_force_insured_10j
days_employment_unsubsidized_5j
days_unemployment_5j
days_out_of_labor_force_insured_5j
days_employment_unsubsidized_2j
days_unemployment_2j
days_out_of_labor_force_insured_2j
days_to_last_job
income_last_job
employment_subsidy_1j
qualification_subsidy_1j
support_subsidy_1j
employment_subsidy_4j
qualification_subsidy_4j
support_subsidy_4j
contact_pes_6m
contact_pes_2j
job_mediation_6m
job_mediation_2j
regional_unemployment
regional_long_time_joblessness
regional_seasonal_unemployment
regional_promise_employment
regional_gdp
regional_job_openings
train_index
unsuccessful
successful
409-Linz
0
Keine
NA
0
M
17
Pflichtschulausbildung
Ledig
Russland
-
Öffentl. Verwaltung, Verteidigung, SV
Produktionsberufe
Konventionsflüchtling
0
NA
1. Generation
0
0
0
1
0
1
0
0
0
1
0
0
1
0
0
1
34
18
3602
34
18
1775
0
0
731
762
774
0
0
0
0
0
0
0
0
0
0
0.0786348
0.3696403
0.0657192
0.0704333
52400
0.2278938
train
unsuccessful
successful
Andere
1
NH
23.20
0
M
29
Lehrausbildung
Ledig
Österreich
A-Laut AMS
Bau
Produktionsberufe
ohne ASYL
0
NA
Kein Migrationshintergrund
0
0
1
0
1
0
0
0
1
0
0
1
0
0
1
0
335
2257
386
0
1463
336
0
671
0
2331
280
1
0
0
1
1
0
1
6
1
1
0.0600450
0.2943468
0.2819971
0.3153480
29400
0.8180097
train
unsuccessful
successful
Andere
0
NH
33.26
0
M
26
Lehrausbildung
Verheiratet
Österreich
-
Beherbergung und Gastronomie
Dienstleistungsberufe
ohne ASYL
0
NA
Kein Migrationshintergrund
0
0
1
0
0
1
0
1
0
0
1
0
0
1
0
0
2192
1243
46
1127
701
5
359
374
3
139
2592
0
0
0
0
0
0
4
9
1
1
0.0525828
0.1586341
0.2133195
0.2889534
46000
0.4919413
train
unsuccessful
unsuccessful
900-Wien
0
NH
11.69
0
M
33
Hoehere Ausbildung
Ledig
Deutschland
-
Gesundheits- und Sozialwesen
Produktionsberufe
ohne ASYL
0
NA
1. Generation
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
396
1159
1630
382
982
168
0
731
0
940
712
0
1
0
0
1
0
3
12
0
1
0.1495503
0.4561777
0.0597223
0.0440014
49600
0.3659622
train
unsuccessful
unsuccessful
900-Wien
0
NH
23.31
0
M
32
Pflichtschulausbildung
Ledig
Österreich
-
Sonst. wirtschaftliche DL
Dienstleistungsberufe
ohne ASYL
0
NA
Kein Migrationshintergrund
0
0
1
0
0
1
0
0
1
0
0
1
0
1
0
0
975
1524
1179
366
889
804
82
639
0
527
1094
0
1
1
1
1
1
2
8
0
0
0.1495503
0.4561777
0.0597223
0.0440014
49600
0.3659622
train
unsuccessful
unsuccessful
900-Wien
1
ALG
44.65
0
M
55
Lehrausbildung
Ledig
Österreich
-
Beherbergung und Gastronomie
Saisonberufe
ohne ASYL
0
NA
Kein Migrationshintergrund
0
0
1
0
1
0
0
1
0
0
1
0
0
1
0
0
3623
35
22
1796
33
22
697
33
22
31
3641
0
0
0
0
0
0
1
1
0
0
0.1495503
0.4561777
0.0597223
0.0440014
49600
0.3659622
train
unsuccessful
unsuccessful
Andere
0
NH
28.89
0
M
58
Pflichtschulausbildung
Ledig
Polen
A-Laut AMS
Gesundheits- und Sozialwesen
Produktionsberufe
ohne ASYL
0
NA
1. Generation
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
322
2874
351
0
1461
350
0
588
145
2053
1269
0
0
0
0
0
0
3
8
0
0
0.0659340
0.3625010
0.1338427
0.1111248
32600
0.3064870
train
Fitting Trees with RPART
tree <-rpart( target_low ~ days_unemployment_2j + age + days_to_last_job,data = train |>select(-train_index, -target_high), cp =0.007 )tree
Let’s assume the Austrian population is getting tested for the Coronavirus
1% of the population is indeed infected, meaning \(P(C) = 0.01\)
A test is 98% accurate, in the sense that \(P(\text{Test}_C \mid C) = P(\text{Test}_{\neg C} \mid \neg C) = 0.98\)
What is the probability that I have the Coronavirus given that I tested positive? \(P(C \mid \text{Test}_C)\)?
Which value from the confusion matrix did we calculate here (and which ones were given)?
Judging Performance in Classification Problems
All the indicators in the confusion matrix can be relevant
Accuracy depends on base rate
accuracy of 96% in sample with 95% positives → poor performance
accuracy of 80% in sample with 50% positives → good performance
Need for adjustment of values relative to a “naive prediction”
Concepts can be generalized to classification problems with more than two categories
Adjusted Performance Measures
Cohen’s kappa: \[ \kappa = \frac{Acc_{mod} - Acc_{0}}{1 - Acc_{0}} \] where \(Acc_{mod}\) is the accuracy of our model and \(Acc_{0}\) is the expected random accuracy
For any sensible model, it holds that \(0 < \kappa < 1\)
If \(\kappa < 0\), our model would be worse than guessing at random
\(\kappa\) tells you “how far you are away from predicting perfectly compared to a naive prediction”
The AMAS has been criticized for intransparency (among other things)
One of the few publicly available documents states the usage of a logit model (which was, contrary to public belief, never implemented) for this algorithm
Or rather, two logit models:
One predicting your short-term chance of labor market integration (assignment to group A if chances are high)
One predicting your long-term labor market integration (assignment to group C if chances are low), given a set of demographic variables and your labor market history
The assignment to groups A, B, and C determines if you are eligible for certain types of subsidies
The Infamous “AMS-Algorithmus” II
In the documentation, it is correctly stated that the cut-off point can be chosen in a way to balance sensitivity and specificity (ignoring the strange definitions of Sensitivity and Specificity)
The Infamous “AMS-Algorithmus” III
What is shown here (and what isn’t)?
The Infamous “AMS-Algorithmus” IV
The cut-off points were set (manually) at 25% for group C and 66% for group A
By setting the cut-off points low for group C and high for group A, they achieved high precision in those two classes
The precision in group B is not shown here, as well as other measures that would help us judge the predictive performance
Cross Validation
The real magic behind supervised machine learning
The Problem of Overfitting
Models achieve an extremely good fit through the usage of many variables and high depth
Danger of underestimating the random error \(\sigma^2\) of the data-generating process
Results in a model with low in-sample prediction error but high out-of-sample error
The Solution to Overfitting
Split your data into a training data set and a test data set
The model is estimated using the training data, and performance measures are calculated using only the test data
Repeat the process for different parameter values (e.g., for the penalty term \(\lambda\)) and choose the value which optimizes some performance measure over the test data
This serves two purposes
As a performance measure independent of the sample where the model was fit
To choose hyperparameters such as \(\lambda\) (regularization)
CART
4830 samples
3 predictor
2 classes: 'successful', 'unsuccessful'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 4346, 4348, 4347, 4347, 4347, 4347, ...
Resampling results across tuning parameters:
cp ROC Sens Spec
5e-04 0.7173981 0.6741891 0.6543656
1e-03 0.7169620 0.7072193 0.6465661
5e-03 0.6996839 0.6867633 0.6737720
5e-02 0.6612506 0.7439858 0.5740977
ROC was used to select the optimal model using the largest value.
The final value used for the model was cp = 5e-04.
Extracting the Model
tree <- tree_caret$finalModelrpart.plot(tree, box.palette ="RdBu", nn =FALSE, type =2)