Supervised Machine Learning

An Introduction using Decision Tree Learning

Introduction

What the fuzz is about

What is Machine Learning?

  • Pattern recognition: abstraction of patterns and regularities from data
  • Predictive modelling: optimized prediction of outcome given a set of inputs
  • Broad range of methods and models
  • Methods are often a “black box” compared to estimating causal effects
  • Often used synonymously with artificial intelligence (AI)

Branches of Machine Learning

  • Supervised learning:
    • Data set already contains values for outcomes
    • Parameters are optimized in a training data set in such a way that the predictions are as “accurate” as possible in a test data set
    • e.g., decision trees, linear models with regularization term, neural networks, etc.
  • Unsupervised learning:
    • There is no predetermined classification or measure of some outcome variable
    • Rather the algorithm comes up with a systematization of the input data (“speaks to us”)
    • e.g., principal component analysis, clustering algorithms, etc.
  • Reinforcement learning:
    • The algorithm is rewarded by ending up in a “good spot” and subsequently chooses the best path through the game
    • e.g., robotic movement, gaming, etc.

Supervised Learning in a Nutshell

  • An outcome is to be predicted
    • Continuous outcome → regression problem
    • Discrete outcome → classification problem
  • Outcome for a set of input variables (features) is known (labelled/annotated)
  • Is split into training and test data
    • Model is fit into training data
    • Performance is evaluated in test data
    • Parameters are chosen to optimize performance
    • Model can be used to predict in new data

Uses of Machine Learning

  • Spam filter
  • Recommender systems
  • Optical character recognition
  • Natural language processing
  • Data imputation
  • Prediction models in science
  • Statistical profiling

Classification and Regression Trees

A simple and powerful algorithm to understand

CART

  • CART (Classification and Regression Trees) methodology by Leo Breiman (1984)
  • Algorithmically: Use of binary decision rules for dealing with classification and regression problems
  • Geometrically: Partitioning of the support space spanned by a set of predictors
  • Statistically: Assigning an outcome by fitting a very simple function (a piecewise constant) over partitions

Tree Graph and Partitioning of the Support Space

Tree Graph

Support Space

Tree Graph and Regression Line

Tree Graph

Regression Line

Classification Problem

  • For a discrete response, the underlying minimization problem can be written as:

\[ \min_{R_t} \left\{ \sum_{t=1}^{T} w_t \cdot G_{R_t} + C(\lambda, T) \right\} \]

    • \(R_t\) is the partition on the support corresponding to the \(t\)-th terminal node
    • \(T\) is the number of terminal nodes of the tree
    • \(w_t\) is the fraction of observations in the \(t\)-th terminal node
    • \(G_{R_t}\) is the Gini impurity for partition \(R_t\) / the \(t\)-th terminal node
    • \(C(\lambda, T)\) is some cost function, sometimes called a regularization term
    • (\(\lambda\) is the tuning parameter that decides the complexity of the model by adding cost)

Example of Gini Impurity

  • The Gini impurity is a measure of information given by \[ G = 1 - \sum_{j=1}^{J} p_j^2 \]
  • Where \(p_j\) is the fraction of class \(j\)
  • Looking at the first node from the classification tree from before, we obtain \[ G = 1 - 0.49^2 - 0.51^2 = 0.4998 \]
  • By splitting at the first node we get a weighted impurity of \[ G = 0.56*(1 - 0.37^2 - 0.63^2) + 0.44*(1 - 0.66^2 - 0.34^2) = 0.4585 \]
  • In the terminal nodes we end up with a weighted Gini impurity of: \[ G = 0.47*(1 - 0.32^2 - 0.68^2) + 0.06*(1 - 0.58^2 - 0.42^2) + ... = 0.4484 \]

Regression Problem

  • For a continuous response, the underlying minimization problem can be written as

    \[ \min_{R_t} \left\{ \; \sum_{t=1}^{T} \sum_{x_i \in R_t} (y_i - \overline{y}_{R_t})^2 \; + \; {C}(\lambda,T) \right\} \]

where \(\overline{y}_{R_t}\) is the mean of the variables in partition \(R_t\)

Algorithm I

  • How are the partitions \(R_t\) determined?
    • Beginning from the first node, each split is determined by the variable and value which minimizes the RSS (regression problem) or impurity (classification problem)
    • End up with the most complex tree explaining every data point
  • Pruning the tree
    • Pre-pruning: additional splits are created until cost of depth is too high
    • Post-pruning: complex tree is created and cut down to a size where cost is met
    • Post-pruning is preferred because stopping too early bears the danger of missing some split down the line

Algorithm II

  • How is the tuning parameter \(\lambda\) determined?
    • Employing cross-validation, we choose the parameter value which minimizes some prediction error in the test/validation data (more on that later)
  • Exact implementation of the algorithm differs
    • Different packages may use different impurity/information measures (e.g., entropy) and different cost functions/tuning parameters
    • e.g., rpart uses post-pruning where the complexity parameter cp regulates the minimum gain in \(R^2\) or minimum decrease of Gini impurity to create another node

Growing the Tree

Advantages and Drawbacks

  • Trees have many advantages
    • Easy to understand and very flexible
    • No theoretical assumptions needed
    • Computationally cheap
    • Automatic feature selection
    • No limitation in number of features
    • Do not necessarily discard observations
  • Drawbacks include
    • No immediate (causal) interpretation of decision
    • Algorithm is “greedy”
    • The curse of dimensionality

The Curse of Dimensionality

  • Volume of the support space increases exponentially with dimensions
    • Data points become dispersed in a high-dimensional space
    • Splits become very volatile
  • To mitigate the problem, random forest is used in practice

Synthetic Unemployment Data

synthetic_unemployment_data <- read_parquet("data/synthetic_unemployment_data.parquet")

set.seed(123)

data <- synthetic_unemployment_data |>
  mutate(
    train_index = sample(
      c("train", "test"), 
      nrow(synthetic_unemployment_data), 
      replace=TRUE, 
      prob=c(0.75, 0.25)
    )
  )

train <- data |> 
  filter(train_index=="train")

test <- data |> 
  filter(train_index=="test")
target_high target_low region_1 promised_employment benefits benefits_amount social_security sex age education family_situation nationality disability nace1 job_sector asylum children age_youngest_child migrational_background employment_1m employment_unsubsidized_1m unemployment_1m out_of_labor_force_1m employment_3m unemployment_3m out_of_labor_force_3m employment_6m unemployment_6m out_of_labor_force_6m employment_1y unemployment_1y out_of_labor_force_1y employment_2y unemployment_2y out_of_labor_force_2y days_employment_unsubsidized_10j days_unemployment_10j days_out_of_labor_force_insured_10j days_employment_unsubsidized_5j days_unemployment_5j days_out_of_labor_force_insured_5j days_employment_unsubsidized_2j days_unemployment_2j days_out_of_labor_force_insured_2j days_to_last_job income_last_job employment_subsidy_1j qualification_subsidy_1j support_subsidy_1j employment_subsidy_4j qualification_subsidy_4j support_subsidy_4j contact_pes_6m contact_pes_2j job_mediation_6m job_mediation_2j regional_unemployment regional_long_time_joblessness regional_seasonal_unemployment regional_promise_employment regional_gdp regional_job_openings train_index
unsuccessful successful 409-Linz 0 Keine NA 0 M 17 Pflichtschulausbildung Ledig Russland - Öffentl. Verwaltung, Verteidigung, SV Produktionsberufe Konventionsflüchtling 0 NA 1. Generation 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 1 34 18 3602 34 18 1775 0 0 731 762 774 0 0 0 0 0 0 0 0 0 0 0.0786348 0.3696403 0.0657192 0.0704333 52400 0.2278938 train
unsuccessful successful Andere 1 NH 23.20 0 M 29 Lehrausbildung Ledig Österreich A-Laut AMS Bau Produktionsberufe ohne ASYL 0 NA Kein Migrationshintergrund 0 0 1 0 1 0 0 0 1 0 0 1 0 0 1 0 335 2257 386 0 1463 336 0 671 0 2331 280 1 0 0 1 1 0 1 6 1 1 0.0600450 0.2943468 0.2819971 0.3153480 29400 0.8180097 train
unsuccessful successful Andere 0 NH 33.26 0 M 26 Lehrausbildung Verheiratet Österreich - Beherbergung und Gastronomie Dienstleistungsberufe ohne ASYL 0 NA Kein Migrationshintergrund 0 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 2192 1243 46 1127 701 5 359 374 3 139 2592 0 0 0 0 0 0 4 9 1 1 0.0525828 0.1586341 0.2133195 0.2889534 46000 0.4919413 train
unsuccessful unsuccessful 900-Wien 0 NH 11.69 0 M 33 Hoehere Ausbildung Ledig Deutschland - Gesundheits- und Sozialwesen Produktionsberufe ohne ASYL 0 NA 1. Generation 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 396 1159 1630 382 982 168 0 731 0 940 712 0 1 0 0 1 0 3 12 0 1 0.1495503 0.4561777 0.0597223 0.0440014 49600 0.3659622 train
unsuccessful unsuccessful 900-Wien 0 NH 23.31 0 M 32 Pflichtschulausbildung Ledig Österreich - Sonst. wirtschaftliche DL Dienstleistungsberufe ohne ASYL 0 NA Kein Migrationshintergrund 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 975 1524 1179 366 889 804 82 639 0 527 1094 0 1 1 1 1 1 2 8 0 0 0.1495503 0.4561777 0.0597223 0.0440014 49600 0.3659622 train
unsuccessful unsuccessful 900-Wien 1 ALG 44.65 0 M 55 Lehrausbildung Ledig Österreich - Beherbergung und Gastronomie Saisonberufe ohne ASYL 0 NA Kein Migrationshintergrund 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 3623 35 22 1796 33 22 697 33 22 31 3641 0 0 0 0 0 0 1 1 0 0 0.1495503 0.4561777 0.0597223 0.0440014 49600 0.3659622 train
unsuccessful unsuccessful Andere 0 NH 28.89 0 M 58 Pflichtschulausbildung Ledig Polen A-Laut AMS Gesundheits- und Sozialwesen Produktionsberufe ohne ASYL 0 NA 1. Generation 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 322 2874 351 0 1461 350 0 588 145 2053 1269 0 0 0 0 0 0 3 8 0 0 0.0659340 0.3625010 0.1338427 0.1111248 32600 0.3064870 train

Fitting Trees with RPART

tree <- rpart(
  target_low ~ days_unemployment_2j + age + days_to_last_job,
  data = train |> select(-train_index, -target_high), 
  cp = 0.007
  )

tree
n= 4830 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 4830 2356 unsuccessful (0.4877847 0.5122153)  
   2) days_to_last_job< 409.5 3183 1270 successful (0.6010053 0.3989947)  
     4) age< 53.5 2804 1017 successful (0.6373039 0.3626961)  
       8) age>=18.5 2563  878 successful (0.6574327 0.3425673)  
        16) days_to_last_job< 116.5 1873  581 successful (0.6898025 0.3101975) *
        17) days_to_last_job>=116.5 690  297 successful (0.5695652 0.4304348)  
          34) days_unemployment_2j< 515.5 567  215 successful (0.6208113 0.3791887) *
          35) days_unemployment_2j>=515.5 123   41 unsuccessful (0.3333333 0.6666667) *
       9) age< 18.5 241  102 unsuccessful (0.4232365 0.5767635) *
     5) age>=53.5 379  126 unsuccessful (0.3324538 0.6675462) *
   3) days_to_last_job>=409.5 1647  443 unsuccessful (0.2689739 0.7310261) *
rpart.plot(tree, box.palette = "RdBu", nn = FALSE, type = 1)

A Perfect Summary I

Performance Measurement

Connecting to statistical theory

Root Mean Squared Error

  • For a regression problem, RMSE is typically used: \[ RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_{i} - \hat{y}_{i})^{2}} \]
  • Or alternatively the (adjusted) \(R^2\): \[ R^2 = 1 - \frac{\sum_{i=1}^{n}({y_i}-\hat{y_i})^2}{\sum_{i=1}^{n}(y_i-\bar{y})^2} \]

Bias–Variance Trade-Off

  • The mean squared error \(MSE\) of a prediction can be written as:

\[ MSE(y, \hat{f}(x)) = \mathbb{E}\left[(y - \hat{f}(x))^2\right] \]

  • Where \(y = f(x) + \epsilon\) is the true data generating process and \(\epsilon\) has variance \(\mathrm{variance}\) and \(\mathbb{E}[\epsilon] = 0\)
  • And \(\hat{f}(x)\) is the function predicting \(y\) given predictors \(x\)

\[ MSE(y, \hat{f}(x)) = \underbrace{(\mathbb{E}(f(x) - \hat{f}(x)))^2}_{\text{bias}} + \underbrace{\mathbb{E}[(\hat{f}(x)) - \mathbb{E}(\hat{f}(x))]^2}_{\text{variance}} + \underbrace{\mathbb{E}[(y - f(x))^2]}_{\text{irreducible error}} = \]

\[ = \underbrace{\mathrm{bias}^2 + \mathrm{variance}}_{\text{reducible error}} + \sigma^2 \]

Trade-Off

  • It is possible to specify \(\hat{{f}}(x)\) in such a way that introduces a bias but reduces the variance, so that the \(MSE\) reduces
  • Notice that this is well in line with what we know from the theory of the linear model
  • To exploit this trade-off in a linear model we can add a regularization term

Ridge Regression and LASSO

  • If we introduce the term \(\lambda \cdot || \beta ||_2\) into the well-known minimization task of a linear model, this is called ridge regression

    \[\widehat{\beta}_{RIDGE} = \min_\beta \left\{ || y - X\beta ||_2 + \lambda \cdot || \beta ||_2 \right\}\]

  • If we use the penalty \(\lambda \cdot || \beta ||_1\), we get the LASSO estimator
  • As a result, the coefficients of our parameters (and in many cases the \(MSE\)) shrink towards zero

Properties of Shrinkage Estimators

  • This can also be used as a method for variable selection
    • Coefficients of variables with low predictive power shrink close to zero (\(\widehat{\beta}_{RIDGE}\)) or to zero (\(\widehat{\beta}_{LASSO}\))
  • Introducing downward bias into coefficients \(\mathbb{E}\left[\widehat{\beta}_{RIDGE}\right] < \beta\)
  • For large \(k\) and fixed \(n\) we often times observe \(MSE(y, \hat{f}_{RIDGE}(x)) < MSE(y, \hat{f}_{OLS}(x))\)
  • For sample size \(n \to \infty\) and fixed number of coefficients \(k\) it holds that \(\widehat{\beta}_{RIDGE} = \widehat{\beta}_{OLS}\)

Coefficient Shrinkage

regularization term \(\lambda\) =

Confusion Matrix

  • for a classification problem we can look at a confusion matrix

Base-Rate Fallacy

  • Beware of the base rate fallacy:
    • Let’s assume the Austrian population is getting tested for the Coronavirus
    • 1% of the population is indeed infected, meaning \(P(C) = 0.01\)
    • A test is 98% accurate, in the sense that \(P(\text{Test}_C \mid C) = P(\text{Test}_{\neg C} \mid \neg C) = 0.98\)
  • What is the probability that I have the Coronavirus given that I tested positive? \(P(C \mid \text{Test}_C)\)?
  • Which value from the confusion matrix did we calculate here (and which ones were given)?

Judging Performance in Classification Problems

  • All the indicators in the confusion matrix can be relevant
  • Accuracy depends on base rate
    • accuracy of 96% in sample with 95% positives → poor performance
    • accuracy of 80% in sample with 50% positives → good performance
  • Need for adjustment of values relative to a “naive prediction”
  • Concepts can be generalized to classification problems with more than two categories

Adjusted Performance Measures

  • Cohen’s kappa: \[ \kappa = \frac{Acc_{mod} - Acc_{0}}{1 - Acc_{0}} \] where \(Acc_{mod}\) is the accuracy of our model and \(Acc_{0}\) is the expected random accuracy
  • For any sensible model, it holds that \(0 < \kappa < 1\)

  • If \(\kappa < 0\), our model would be worse than guessing at random

  • \(\kappa\) tells you “how far you are away from predicting perfectly compared to a naive prediction”

Confusion Matrix in R

test$prediction_tree <- predict(
  tree, 
  newdata = test, 
  type = c("class")
  )

confusion <- confusionMatrix(
  data = test$prediction_tree,
  reference = test$target_low,
  positive = "successful",
  mode = "sens_spec"
  )
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          468          298
  unsuccessful        260          544
                                          
               Accuracy : 0.6446          
                 95% CI : (0.6203, 0.6683)
    No Information Rate : 0.5363          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.2879          
                                          
 Mcnemar's Test P-Value : 0.1173          
                                          
            Sensitivity : 0.6429          
            Specificity : 0.6461          
         Pos Pred Value : 0.6110          
         Neg Pred Value : 0.6766          
             Prevalence : 0.4637          
         Detection Rate : 0.2981          
   Detection Prevalence : 0.4879          
      Balanced Accuracy : 0.6445          
                                          
       'Positive' Class : successful      
                                          

ROC- and PR-Curve

  • Trade-off between the measures, which we can exploit by setting the cut-off point for predicted scores
  • Receiver Operating Characteristic (ROC) curve
    • Plot of sensitivity against 1-specificity
    • Sometimes high sensitivity is preferred over specificity (e.g., medical tests), or vice versa
  • Precision-Recall (PR) curve
    • Plot of precision against sensitivity (recall)
    • Might be preferred over ROC when categories are very imbalanced

Receiver Operating Characteristic

  • The trade-off through the setting of the cut-off point can be visualized in a ROC curve

Receiver Operating Characteristic

  • ROC criterion
    • There is a point on the curve where the loss of sensitivity and specificity are equal
    • This value can be used to determine an optimal cut-off point
  • Area Under the Curve (AUC)
    • The area underneath the ROC curve can be used as another performance measure
    • Different models achieve different sensitivity and specificity at the same cut-off point
    • The model with the highest AUC is best at holding the trade-off low

Cut-Off in R

test$score_tree <- predict(
  tree, 
  newdata = test, 
  type = c("prob")
  )[,1]

test <- test |>
  mutate(prediction_tree = as.factor(ifelse(
    score_tree > 0.3 , 
    "successful", 
    "unsuccessful"
    )))

confusion <- confusionMatrix(
  data = test$prediction_tree, 
  reference = test$target_low,
  positive = "successful", 
  mode = "sens_spec"
  )
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          565          466
  unsuccessful        163          376
                                          
               Accuracy : 0.5994          
                 95% CI : (0.5746, 0.6237)
    No Information Rate : 0.5363          
    P-Value [Acc > NIR] : 2.776e-07       
                                          
                  Kappa : 0.2166          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.7761          
            Specificity : 0.4466          
         Pos Pred Value : 0.5480          
         Neg Pred Value : 0.6976          
             Prevalence : 0.4637          
         Detection Rate : 0.3599          
   Detection Prevalence : 0.6567          
      Balanced Accuracy : 0.6113          
                                          
       'Positive' Class : successful      
                                          

ROC- and PR-Curve in R

test$prediction_tree_scores <- predict(tree, test, type = c("prob"))[,2]
test$prediction_random <- runif(n = nrow(test))

precrec_obj <- evalmod(
  scores = cbind(test$prediction_tree_scores, test$prediction_random), 
  labels = cbind(test$target_low, test$target_low), 
  modnames = c("classification tree", "random"),
  ties_method = "first"
  )

Interacting with the Cut-Off

The Infamous “AMS-Algorithmus” I

  • The AMAS has been criticized for intransparency (among other things)
  • One of the few publicly available documents states the usage of a logit model (which was, contrary to public belief, never implemented) for this algorithm
  • Or rather, two logit models:
    • One predicting your short-term chance of labor market integration (assignment to group A if chances are high)
    • One predicting your long-term labor market integration (assignment to group C if chances are low), given a set of demographic variables and your labor market history
  • The assignment to groups A, B, and C determines if you are eligible for certain types of subsidies

The Infamous “AMS-Algorithmus” II

  • In the documentation, it is correctly stated that the cut-off point can be chosen in a way to balance sensitivity and specificity (ignoring the strange definitions of Sensitivity and Specificity)

The Infamous “AMS-Algorithmus” III

  • What is shown here (and what isn’t)?

The Infamous “AMS-Algorithmus” IV

  • The cut-off points were set (manually) at 25% for group C and 66% for group A
  • By setting the cut-off points low for group C and high for group A, they achieved high precision in those two classes
  • The precision in group B is not shown here, as well as other measures that would help us judge the predictive performance

Cross Validation

The real magic behind supervised machine learning

The Problem of Overfitting

  • Models achieve an extremely good fit through the usage of many variables and high depth
  • Danger of underestimating the random error \(\sigma^2\) of the data-generating process
  • Results in a model with low in-sample prediction error but high out-of-sample error

The Solution to Overfitting

  • Split your data into a training data set and a test data set
  • The model is estimated using the training data, and performance measures are calculated using only the test data
  • Repeat the process for different parameter values (e.g., for the penalty term \(\lambda\)) and choose the value which optimizes some performance measure over the test data
  • This serves two purposes
    • As a performance measure independent of the sample where the model was fit
    • To choose hyperparameters such as \(\lambda\) (regularization)

Out of Sample Error

  • Prediction will always be better in the training data
  • Bias-Variance Trade-Off exists over model complexity

Types of Cross Validation

  • Simple hold-out
  • k-fold cross-validation
  • LOOCV (leave-one-out cross-validation)
  • LGOCV (leave-group-out cross-validation)
  • OOB (out-of-bag samples)
  • Time series-specific cross-validation (e.g., day-forward chaining)

k-Fold Cross Validation

Properties of Cross Validation

  • Training and testing environment should closely reflect the prediction problem to get an accurate expectation of prediction error
  • Simple hold-out sample less suitable for optimization
  • k-fold cross-validation gives downward-biased estimate of performance because it only utilizes the fraction \(\frac{k-1}{k}\) for training
  • LOOCV gives approximately unbiased estimate of performance but has a higher variance
  • LGOCV can be meaningful if the prediction problem tries to infer information from one group to another (e.g., countries, groups of people, etc.)
  • Time-specific methods exclude future information in training

A Perfect Summary II

CV in R - Caret

  • caret: classification and regression training
  • trControl takes care of the cross-validation process
    • method here specifies type of CV
    • summaryFunction specifies computation of performance measures
  • tuneGrid chooses the parameters to try for the model
control <- trainControl(
  method = "repeatedcv", 
  number = 10, 
  repeats = 10,
  savePredictions = T,
  classProbs = T, 
  summaryFunction = twoClassSummary
  )
tuning_grid <- expand.grid(
  cp = c(
    0.0005, 
    0.001, 
    0.005,
    0.05
    )
  )

Training the Model

  • metric chooses what performance measure you want to optimize
  • method specifies the model, which can be implemented in some other package
  • Function will automatically choose the parameters which work best in the specified training and test process
tree_caret <- train(
  data = train |> select(-train_index, -target_high), 
  target_low ~ days_unemployment_2j + age + days_to_last_job,
  method = "rpart", 
  trControl = control, 
  tuneGrid = tuning_grid, 
  metric = "ROC", 
  na.action = na.pass
  )
CART 

4830 samples
   3 predictor
   2 classes: 'successful', 'unsuccessful' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 4346, 4348, 4347, 4347, 4347, 4347, ... 
Resampling results across tuning parameters:

  cp     ROC        Sens       Spec     
  5e-04  0.7173981  0.6741891  0.6543656
  1e-03  0.7169620  0.7072193  0.6465661
  5e-03  0.6996839  0.6867633  0.6737720
  5e-02  0.6612506  0.7439858  0.5740977

ROC was used to select the optimal model using the largest value.
The final value used for the model was cp = 5e-04.

Extracting the Model

tree <- tree_caret$finalModel
rpart.plot(tree, box.palette = "RdBu", nn = FALSE, type = 2)

Predicting in Test Data

test$prediction_caret <- predict.train(
  tree_caret, 
  newdata = test, 
  type = c("raw")
  )

confusion <- confusionMatrix(
  test$target_low, 
  test$prediction_caret, 
  positive = "successful", 
  mode="sens_spec"
  )
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          409          235
  unsuccessful        234          438
                                          
               Accuracy : 0.6436          
                 95% CI : (0.6171, 0.6695)
    No Information Rate : 0.5114          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.2869          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.6361          
            Specificity : 0.6508          
         Pos Pred Value : 0.6351          
         Neg Pred Value : 0.6518          
             Prevalence : 0.4886          
         Detection Rate : 0.3108          
   Detection Prevalence : 0.4894          
      Balanced Accuracy : 0.6434          
                                          
       'Positive' Class : successful      
                                          

Model Comparison

test$prediction_caret_scores <- predict.train(tree_caret, test, type = c("prob"))$unsuccessful

precrec_obj <- evalmod(
  scores = cbind(test$prediction_tree_scores, test$prediction_caret_scores), 
  labels = cbind(test$target_low, test$target_low), 
  modnames = c("classification tree", "classification tree (optimized)"),
  ties_method = "first"
  )