Introduction

Entropy is used for a lot of things in data science. For example, entropy can be used to build classification trees. Entropy is also the basis of something called Mutual Information, which quantifies the relationship between two things. Entropy is the basis of Relative Entropy (also known as The Kullback-Leibler Distance) and cross entropy, which show up all over the place, including fancy dimension reduction algorithms like t-SNE and UMAP.

What these three things have in common is that they all use entropy, or something derived from it, to quantify similarities and differences.

In order to talk about Entropy, first we have to understand Surprise.

Imagine we have two types of chickens, orange and blue. Instead of just letting them randomly roam all over the screen, say, we organize them into three separate areas, A, B, and C.

Now, if you randomly pick up a chicken from A, because there are more orange chickens, it’s more likely to pick orange over blue, thus, it’d not be very surprising to pick an orange chicken. In contrast, it’d be relatively surprising if a blue chicken is picked up. Similar opposite situation for the area B. Area C has equal number of orange and blue chickens. Thus, regardless of what color of chicken we pick up, we’d be equally surprised.

Surprise is, in some way, inversely related to probability. In other words, when the probability is low, the surprise is high and visa versa.

Let’s see how to quantify surprise?

Because we know there is a type of inverse relationship between probability and surprise, it’s tempting to just use the inverse of probability to calculate surprise,

$\text{surprise}=\frac{1}{\small{\text{probability}}}$

There’s at least one problem with just using the inverse of of the probability to calculate surprise. Let’s examine this problem through an example fo surprise in flipping a coin.

Imagine we had a terrible coin that always come up heads, i.e. $\text{prob(heads) = 1}$ . That means coming heads is no surprise. Now, if we take the inverse of that, it’ll be also $1$ , whereas we expected surprise to be $0$ .
Now, if instead of inverse we use the log of inverse, then we’ll get $0$ for such cases.

Surprise is the log of the inverse of the probability.

$\text{surprise}=\log{(\frac{1}{\small{\text{probability}}})}$

Note: When calculating surprise for 2 outputs, it is customary to use the log base 2 for the calculations.

Note: For calculating the suprise of multiple events, we just need to add up their individual surprises. For instance, the surprise for getting 2 heads and 1 tails in 3 coin flips is the $\text{surprise(heads) + surprise(heads) + surprise(tails)}$ .

Now, let’s say we have coin with $\small{\textbf{prob(heads)=0.9 and prob(tails)=0.1}}$ .

	Heads	Tails
$\text{\small{probability:} p(x)}$	0.9	0.1
$\text{\small{surprise:}} \log_2 {\frac{1}{p}}$	0.15	3.32

How do we estimate the total surprise after flipping the coin 100 times?

We approximate how many times we will get heads by multiplying its probability by 100,
We estimate the total surprise from getting 100 heads by multiplying it by $0.15$ .

Likewise, we can get the estimate from getting 100 tails.

If we divide everything by the number of coin tosses, we get average amount of surprise per coin toss,

This average suprise is the entropy of the coin, i.e. the expected surprise every time we flip the coin.

Entropy is the expected value of surprise.

Note: Number of coin toss really doesn’t matter here, since it gets cancelled out.

$\text{Entropy} = \sum x P(X=x)$

Where,

$x \rightarrow \text{specific value for }\textbf{surprise} \rightarrow \log_2 \frac{1}{p(x)}$
$P(X=x) \rightarrow \text{the probability of observing that specific value for }\textbf{surprise} \rightarrow p(x)$

Therefore, the equation for entropy is

$\text{Entropy} = \sum \log(\frac{1}{p(x)}) p(x)$

Note: Unfortunately, even though this equation is made from two relatively easy to interpret terms, i.e. surprise $\times$ prob(surprise), this isn’t the standard form of the equation for the entropy. The well-known equation is just the simplified version of it,

$\text{Entropy} = -\sum p(x)\log(p(x))$

This is entropy equation that Claude Shannon first published in 1948.

Now, going back to our chicken example, if we calculate the entropy for A, B, and C, we’ll see that entropy is highest when we have the same number of both types of chickens, and as we increase the difference in the number of orange and blue chickens, we lower the entropy.