Naive Bayes
Table of Content
1. Naive Bayes 1.1. What is Laplace Smoothing? 1.2. How to Prepare the Data for Naive Bayes?
1. Naive BayesLet's say we want to identify spam messages.We can use Bayes Theorem to formulate probability of a spam message based on appearance of some words in the message. P(spam|w)=P(spam).P(w|spam)P(spam).P(w|spam)+P(not spam).P(w|not spam) w represents the vocabulary, V={w1,w2,,wn}, i.e. just a list of words that our model recognizes.P(spam) indicates the probability of seeing a spam message regardless of the word.Note: P(spam) is called priors, and P(w|spam) and P(w|not spam) are called likelihoods. The denominator is called the evidence, and P(spam|w) is called the posterior.Note: Since we're modeling the presence/absence of a particular word, this is called a Bernoulli Model.In order to calculate the probability of spam, P(wi|spam), given any particular word, wi, we use chain rule in probability. P({¬w1,¬w2,,wi,,¬wn}|spam)=P(¬wn|spam).P(¬wn-1|spam,¬wn).P(¬wn-2|spam,{¬wn,¬wn-1})....P(wi|spam,{¬wn,¬wn-1,,¬wn-i})....P(¬w1|spam,{¬wn,¬wn-1,,wi,,¬w2}) where ¬w indicates not existence of word w in a message.If n is large, then the amount of calculations will get really high. So, we use a simplifying assumption that words are independent of each other. Therefore, equations (2) becomes, P({¬w1,¬w2,,wi,,¬wn}|spam)=P(¬wn|spam).P(¬wn-1|spam)..P(wi|spam)..P(¬w1|spam) Note: The simplifying assumption can potentially disregard some useful information since some words are more likely to appear in a sentence, e.g. London and England. Due to this simplifying assumption, this model is called Naive Bayes. Note: P(wk|spam) is the probability of seeing word wk in a spam message, i.e. no. of spam messages with wordwTotal no. of messages with wordw , similarly for P(¬wk|spam) or P(wk|not spam). Note: In practice, each word, wk, is represented by one-hot encoded vector. The Problem of Zero Probability: Since probability of different words are multiplied to each other, if the probability of one word (or more) is 0, then it'll make the entire probability 0. In order to solve this issue, Naive Bayes applies Laplace Smoothing to every word in the vocabulary. 1.1. What is Laplace Smoothing?Laplace smoothing is a smoothing technique that handles the problem of zero probability in Naive Bayes. Using Laplace smoothing, we can represent P(wk|spam) as, P(wk|spam)=no. of spam messages with wordw+𝛼N+𝛼.n where:N is the total number of spam messages.n is the vocabulary size.𝛼 is the smoothing parameter.* Using higher 𝛼 values will push the likelihood towards a value of 0.5, i.e. the probability of a word is equal to 0.5 for both spam and not spam messages.* This is not so useful. In practice, it's preferred to use 𝛼=1.1.2. How to Prepare the Data for Naive Bayes?Let's say we have the following message:"Hey, good point here - This is interesting."Here are the steps we do to prepare this sentence:Remove white spaceRemove punctuationTokenizing (creating a list of words/token) → ["Hey", "good", "point", "here", "-", "This", "is", "interesting"]Remove stop words (i.e. words that don't add much information) → ["Hey", "good", "point", "-", "interesting"]Remove non-alphabetic words → ["Hey", "good", "point", "interesting"]Stemming (i.e. Removing the ending modifiers of words, leaving the stem of the word) → ["Hey", "good", "point", "interest"]* Lemmatization: A more calculated form of stemming which ensures the proper lemma results from removing the word modifiers.* The problem with lemmatization is that it is often more expensive (than stemming). So, with large data, you may want to go with stemming over lemmatization. We can represent words in terms of binary vectors of 0 and 1. This is called vectorization. Back to Top