2_model_performance

Model Performance Evaluation

Table of Content

1. Performance 1.1. Confusion Matrix 1.2. Can we control the sensitivity and specificity tradeoff? 1.3. Cross Validation 1.4. Receiver Operator Characteristic (ROC) curve 1.5. Hyperparameter Tuning 1.6. Types of Cross Validation 1.7. Steps of ML Modeling

1. Performance• Continuing from section ?, let's say we want to measure how good the spam detection model is.– We can do so by determining a cutoff point/decision point (say

0.5

) and predict

s p a m

when the probability is

> 0.5

and

n o t - s p a m

otherwise. – Then, we can calculate the accuracy metric, which is simply the number of correctly predicted divided by total number of examples.• Note: Accuracy is not such a good metric to measure the performance of the spam model because

93 %

of the data is

n o t - s p a m

→ imbalanced dataset. So, by just predicting

0

for all the examples, we'd got

93 %

in accuracy.• Note: One problem, particularly with imbalanced data, is that we often care more about the performance on the minority class which in this case is predicting

s p a m

examples correctly. – There are two ways the model could predict a

s p a m

incorrectly:* False Positive → predicting

s p a m

when it's actually a

n o t - s p a m

.* False Negative → predicting

n o t - s p a m

when it's actually a

s p a m

.* The other cases are called True Positive → predicting

s p a m

when it's actually a

s p a m

and True Negative → predicting

n o t - s p a m

when it's a

n o t - s p a m

.1.1. Confusion Matrix• We can summarize all the above in something called a confusion matrix.

		Actual
		Positive	Negative
Predicted	Positive	True Positive (TP)	False Positive (FP)
Predicted	Negative	False Negative (FN)	True Negative (TN)

• Sensitivity =

\frac{T P}{T P + F N}

→ model's ability to correctly classify

s p a m

messages (or positive cases). Higher Sensitivity → fewer False Negative. • Specificity =

\frac{T N}{T N + F P}

→ it represents the classifier's ability correctly classify the

n o t - s p a m

messages (or negative cases). Higher Specificity → fewer False Positives.• – Note: In the case of spam detection model, we'd prefer higher specificity such that an important message wouldn't be falsely classified as

s p a m

.– Note: In some other problems such as cancer detection, we'd prefer higher sensitivity because we want as few false negatives as possible. • Precision =

\frac{T P}{T P + F P}

→ it just measures how accurately the positives are classified. •

F_{1}

Score =

\frac{2. (s e n s i t i v i t y \times p r e c i s i o n)}{s e n s i t i v i t y + p r e c i s i o n}

→ it is the harmonic mean of the sensitivity and precision. 1.2. Can we control the sensitivity and specificity tradeoff?• Higher Sensitivity → Less FN • Higher Specificity → Less FP• We can change the tradeoff by changing the cutoff point.1.3. Cross Validation• To test the performance of our model, we usually split the data into three parts:– Training set– Validation set– Test set • The validation set gives the opportunity to tune our model without using the test set itself.• We use the test set merely for evaluating our model performance on unseen examples. 1.4. Receiver Operator Characteristic (ROC) curve• ROC curve is plotted on

s e n s i t i v i t y

on one axis and

1 - s p e c i f i c i t y

on the other axis.• As we tune our model on the validation set, we can plot the sensitivities and specificities that each cutoff threshold produces. – The

45^{°}

line shows that for every positive example that we correctly classify, we also incorrectly classify a negative example.– The goal for every model should be to always lie above or be better than the

45^{°}

line.– To obtain a good balance specificity and sensitivity, we ought to pick a threshold that maximizes the distance away from the

45^{°}

line.• In order to compare different models, we use the Area Under the Curve (AUC) of ROC. Whichever model that has higher AUC is the model that we can confidently say is a better predictor. 1.5. Hyperparameter Tuning• Hyperparameters are parameters that go along with the model that you don't necessarily train.1.6. Types of Cross Validation• Hold-out Validation → We assign a subset of examples to be our validation set.• K-fold Validation → We train

k

different models and use a different validation set each time. • Leave-One-Out Validation → It's the k-fold validation when

k = n

, where

n

is the number of examples → more used when we have small amount of data.1.7. Steps of ML Modeling1. Problem2. Hypothesis3. Simple Heuristic4. Measure Impact5. More Complex Technique6. Measure Impact7. Tune Model8. Replace Existing Technique Back to Top