1. Naive Bayes• Let's say we want to identify spam messages.• We can use Bayes Theorem to formulate probability of a spam message based on appearance of some words in the message.P(spam|w)=P(spam).P(w|spam)
P(spam).P(w|spam)+ P(not spam).P(w|not spam)• w represents the vocabulary, V ={w1,w2,…,wn}, i.e. just a list of words that our model recognizes.• P(spam) indicates the probability of seeing a spam message regardless of the word.• Note:P(spam) is called priors, and P(w|spam) and P(w|not spam) are called likelihoods. The denominator is called the evidence, and P(spam|w) is called the posterior.• Note: Since we're modeling the presence/absence of a particular word, this is called a Bernoulli Model.• In order to calculate the probability of spam, P(wi|spam), given any particular word, wi, we use chain rule in probability.P({¬w1,¬w2,…,wi,…,¬wn}|spam)=P(¬wn|spam).P(¬wn-1|spam,¬wn).P(¬wn-2|spam,{¬wn,¬wn-1})....P(wi|spam,{¬wn,¬wn-1,…,¬wn-i})....P(¬w1|spam,{¬wn,¬wn-1,…,wi,…,¬w2})• • where ¬w indicates not existence of word w in a message.• If n is large, then the amount of calculations will get really high. So, we use a simplifying assumption that words are independent of each other. Therefore, equations (2) becomes,P({¬w1,¬w2,…,wi,…,¬wn}|spam)=P(¬wn|spam).P(¬wn-1|spam).….P(wi|spam).….P(¬w1|spam)• Note: The simplifying assumption can potentially disregard some useful information since some words are more likely to appear in a sentence, e.g. London and England. – Due to this simplifying assumption, this model is called Naive Bayes.– • Note:P(wk|spam) is the probability of seeing word wk in a spam message, i.e. no. of spam messages with word w
Total no. of messages with word w , similarly for P(¬wk|spam) or P(wk|not spam).• • Note: In practice, each word, wk, is represented by one-hot encoded vector.• The Problem of Zero Probability: Since probability of different words are multiplied to each other, if the probability of one word (or more) is 0, then it'll make the entire probability 0. – In order to solve this issue, Naive Bayes applies Laplace Smoothing to every word in the vocabulary.1.1. What is Laplace Smoothing?• Laplace smoothing is a smoothing technique that handles the problem of zero probability in Naive Bayes. Using Laplace smoothing, we can represent P(wk|spam) as,P(wk|spam)=no. of spam messages with word w + 𝛼
N + 𝛼.n• where:– N is the total number of spam messages.– n is the vocabulary size.– 𝛼 is the smoothing parameter.* Using higher 𝛼 values will push the likelihood towards a value of 0.5, i.e. the probability of a word is equal to 0.5 for both spam and not spam messages.* This is not so useful. In practice, it's preferred to use 𝛼=1.1.2. How to Prepare the Data for Naive Bayes?• Let's say we have the following message:– "Hey, good point here - This is interesting."• Here are the steps we do to prepare this sentence:– Remove white space– Remove punctuation– Tokenizing (creating a list of words/token) → ["Hey", "good", "point", "here", "-", "This", "is", "interesting"]– Remove stop words (i.e. words that don't add much information) → ["Hey", "good", "point", "-", "interesting"]– Remove non-alphabetic words → ["Hey", "good", "point", "interesting"]– Stemming (i.e. Removing the ending modifiers of words, leaving the stem of the word) → ["Hey", "good", "point", "interest"]* Lemmatization: A more calculated form of stemming which ensures the proper lemma results from removing the word modifiers.* The problem with lemmatization is that it is often more expensive (than stemming). So, with large data, you may want to go with stemming over lemmatization.StemmingLemmatizationStudying → StudyStudies → StudiStudying → StudyStudies → Study• We can represent words in terms of binary vectors of 0 and 1. This is called vectorization.Back to Top