# n-gram, entropy and entropy rate

n-gram models find use in many areas of computer science, but are often only explained in the context of natural language processing (NLP). In this post, I will show the relation to information theory by explicitly calculating the entropy and entropy rate of unigrams, bigrams and general $n$-grams.

## Overview

$n$-grams split texts into contiguous sequences of $n$ words or letters and assign a probability to each sequence. When $n \geq 2$, we obtain Markov models of order $n-1$.

In NLP, $n$-grams are used to predict the next letter/word based on the previous $n-1$ letters/words. More formally, we can write

where each $X_i$ indicates a letter, $\boldsymbol{\theta}$ is a fixed parameter and $\mathbf{x} \in \{0, 1\}^k$. In practice, $\boldsymbol{\theta}$ is not known and has to be estimated. We will use here the maximum likelihood estimate, which is simply $\frac{\text{frequency letter}}{\text{all letters}}$.

Example: Let the text be “abac”. Given that the last letter was “a”, what is the probability that the next letter is “b”?

The parameter is $% $ and the input is $% $.

Then the probability is $p_{X_2 \mid X_{1}}(\mathbf{x} ; \boldsymbol{\theta}_a) = \left(\frac{1}{2}\right)^{1} \cdot \left(\frac{1}{2}\right)^0 = \frac{1}{2}$.

This formal notation based on the categorical distribution can get a bit cumbersome. From now on, I will write the input to the PMFs simply as letters e.g. $p_{X_2 \mid X_{1}}(b \mid a) = \frac{1}{2}$.

## Unigram

### Unigram

Unigrams (also called $1$-grams or bag-of-words models) are a special case of $n$-grams where each sequence consists of a single word or letter. Unigrams depend only on the current letter.

A unigram does the following approximation:

Hence, it is assumed that $X_1, \dots, X_n$ are all i.i.d random variables.

Example: umulmundumulmum$Letter Frequency Probability u 6 6/16 m 5 5/16 l 2 2/16 n 1 1/16 d 1 1/16$ 1 1/16

In the example, we calculated the probability by dividing the frequency of a single letter by the length of the text.

The unigrams can be used to calculate the probability of certain letters occurring:

Note that i.i.d implies stationarity.

### Entropy

The most common definition of entropy is the following:

For example, the entropy of the word “umulmundumulmum$” is: In data compression, there exists another definition called $0$-th order empirical entropy: Empirical entropy makes no direct probabilistic assumptions and requires a text-like structure (i.e. where frequency estimates make sense). $H(X)$ is in comparison more general and one needs to specify the probability distribution. In our example, the variables for $H_0(T)$ are: Both definitions give the same result. ### Entropy rate For a unigram, the entropy rate is defined as follows: Since $X_1, \dots, X_n$ are i.i.d random variables, it suffices to calculate the entropy of one $X_i$. The same is true for our example. We can simply calculate one of the random variables i.e. $H(X_{16}) = \cdots = H(X_1)$. ## Bigram ### Bigram In this section, we will look at bigrams, which are $n$-grams with $n=2$. Bigrams depend on the previous letter. A bigram does the following approximation: Hence, bigrams satisfy the Markov property. We previously assumed i.i.d random variables. Now, we need stationarity (also known as time-homogeneity in the context of Markov chains): Example: umulmundumulmum$

Letters Frequency Conditional probability
mu 4 4/5
um 3 3/6
ul 2 2/6
lm 2 2/2
un 1 1/6
nd 1 1/1
du 1 1/1
m$1 1/5 In the example, the frequency was calculated by generating all possible contiguous sequences of two in the text. All sequences with the same initial letter in the table (e.g. “m” of “mu”) form a probability distribution. Let us calculate the probabilities of some letters: Compare this with the i.i.d assumption from the unigram. By using bigrams, we can take into account the context in which the letter occurs. As a result, unigrams and bigrams have different probabilities. Since we satisfy the Markov property, it is easy to visualize the bigrams. Let us draw the Markov chain of the underlying stochastic process. The code will output the following graph: When a Markov chain is time-homogeneous, irreducible and aperiodic, there exists a unique stationary distribution $\boldsymbol{\pi}$. The chain in the graph is aperiodic, because we satisfy $\text{gcd}(2, 3) = 1$. However, it is not yet irreducible. By adding a transition from “$” to “U”, we can also make it irreducible.

### Entropy

The $1$-th order entropy can be defined as follows:

We can apply this formula on our example:

In data compression, the following definition is more common:

For example, when $w = u$:

Again both definitions give the same result.

### Entropy rate

The entropy rate of a bigram is:

where $\mathbf{P}$ is the transition matrix of the time-homogeneous Markov chain and $\boldsymbol{\pi}$ is the corresponding stationary distribution.

Since the stochastic process is a Markov chain, the entropy rate is again just the entropy. In the example, $H(X_2 \mid X_{1}) \approx 0.7728$.

To verify this result, let us add the transition $(, U, 1)$ to our Markov chain. Now, $\boldsymbol{\pi}$ can be calculated:

The code calculates the left eigenvectors and chooses the one with the highest eigenvalue as $\boldsymbol{\pi}$. This highest eigenvalue will be $1,$ because the matrix $\mathbf{P}$ is stochastic.

We obtain

Note that $0.375 = \frac{6}{16}$ and $0.3125 = \frac{5}{16}$. Compare these values with our previous entropy calculation. Hence, we found the same result.

## n-gram

### n-gram

Before we only considered the case of $n$-grams with $n = 1$ or $n = 2$, but $n$ can be any number:

For example, we can condition on $X_1, \dots, X_{15}$ and predict the $16$th letter:

The choice of $n$ depends on many factors, but in general low values like $n = 2$, $n = 3$ or $n = 4$ are more popular. When we have a vocabulary of $k$ unique words, then there are $k^n$ possibilities to combine these words. Hence, higher values of $n$ require bigger corpora.

### Entropy

To calculate the $n$-th order entropy, one simply conditions on $n-1$ random variables:

The definition for data compression is similar to the previous one:

Instead of manually calculating the $n$-th order entropy for all $n$, we can also use a suffix tree (see this article).

The following code constructs a compressed suffix tree by using the library sdsl-lite and then calculates the entropy for $0 \leq n \leq 10$:

Compile the code with g++ -std=c++11 -O3 -DNDEBUG -I ~/include -L ~/lib entropy.cpp -o entropy -lsdsl -ldivsufsort -ldivsufsort64

We obtain as result:

We can see that conditioning reduces entropy, thus $H_{k+1}(T) \leq H_k(T)$.

### Entropy rate

The entropy rate is:

The equality follows from stationarity.

When $n$ is finite, calculating the entropy rate does not make a lot of sense. We saw before for i.i.d random variables and a specific $n,$ it is just entropy.

However, when the stochastic process is not stationary e.g.

or when we take $n$-grams with $n\to\infty$. Then we can get different results (as long as the limit exists).

## References

[1] T. Cover and J. Thomas, Elements of Information Theory. Hoboken, N.J: Wiley-Interscience, 2006.

Categories:

Updated: