Model Performance Evaluation
Table of Content
1. Performance 1.1. Confusion Matrix 1.2. Can we control the sensitivity and specificity tradeoff? 1.3. Cross Validation 1.4. Receiver Operator Characteristic (ROC) curve 1.5. Hyperparameter Tuning 1.6. Types of Cross Validation 1.7. Steps of ML Modeling
1. PerformanceContinuing from section ?, let's say we want to measure how good the spam detection model is.We can do so by determining a cutoff point/decision point (say 0.5) and predict spam when the probability is >0.5 and not-spam otherwise. Then, we can calculate the accuracy metric, which is simply the number of correctly predicted divided by total number of examples.Note: Accuracy is not such a good metric to measure the performance of the spam model because 93% of the data is not-spamimbalanced dataset. So, by just predicting 0 for all the examples, we'd got 93% in accuracy.Note: One problem, particularly with imbalanced data, is that we often care more about the performance on the minority class which in this case is predicting spam examples correctly. There are two ways the model could predict a spam incorrectly:* False Positive → predicting spam when it's actually a not-spam.* False Negative → predicting not-spam when it's actually a spam.* The other cases are called True Positive → predicting spam when it's actually a spam and True Negative → predicting not-spam when it's a not-spam.1.1. Confusion MatrixWe can summarize all the above in something called a confusion matrix.
Actual
PositiveNegative
PredictedPositiveTrue Positive (TP)False Positive (FP)
NegativeFalse Negative (FN)True Negative (TN)
Sensitivity = TPTP+FN → model's ability to correctly classify spam messages (or positive cases). Higher Sensitivity → fewer False Negative. Specificity = TNTN+FP → it represents the classifier's ability correctly classify the not-spam messages (or negative cases). Higher Specificity → fewer False Positives. Note: In the case of spam detection model, we'd prefer higher specificity such that an important message wouldn't be falsely classified as spam.Note: In some other problems such as cancer detection, we'd prefer higher sensitivity because we want as few false negatives as possible. Precision = TPTP+FP → it just measures how accurately the positives are classified. F1 Score = 2.(sensitivity×precision)sensitivity+precision → it is the harmonic mean of the sensitivity and precision. 1.2. Can we control the sensitivity and specificity tradeoff?Higher Sensitivity → Less FN Higher Specificity → Less FPWe can change the tradeoff by changing the cutoff point.1.3. Cross ValidationTo test the performance of our model, we usually split the data into three parts:Training setValidation setTest set The validation set gives the opportunity to tune our model without using the test set itself.We use the test set merely for evaluating our model performance on unseen examples. 1.4. Receiver Operator Characteristic (ROC) curveROC curve is plotted on sensitivity on one axis and 1-specificity on the other axis.As we tune our model on the validation set, we can plot the sensitivities and specificities that each cutoff threshold produces. The 45° line shows that for every positive example that we correctly classify, we also incorrectly classify a negative example.The goal for every model should be to always lie above or be better than the 45° line.To obtain a good balance specificity and sensitivity, we ought to pick a threshold that maximizes the distance away from the 45° line.In order to compare different models, we use the Area Under the Curve (AUC) of ROC. Whichever model that has higher AUC is the model that we can confidently say is a better predictor. 1.5. Hyperparameter TuningHyperparameters are parameters that go along with the model that you don't necessarily train.1.6. Types of Cross ValidationHold-out Validation → We assign a subset of examples to be our validation set.K-fold Validation → We train k different models and use a different validation set each time. Leave-One-Out Validation → It's the k-fold validation when k=n, where n is the number of examples → more used when we have small amount of data.1.7. Steps of ML Modeling1. Problem2. Hypothesis3. Simple Heuristic4. Measure Impact5. More Complex Technique6. Measure Impact7. Tune Model8. Replace Existing Technique Back to Top