n-gram models find use in many areas of computer science, but are often only explained in the context of natural language processing (NLP). In this post, I will show the relation to information theory by explicitly calculating the entropy and entropy rate of unigrams, bigrams and general -grams.
-grams split texts into contiguous sequences of words or letters and assign a probability to each sequence. When , we obtain Markov models of order .
In NLP, -grams are used to predict the next letter/word based on the previous letters/words. More formally, we can write
where each indicates a letter, is a fixed parameter and . In practice, is not known and has to be estimated. We will use here the maximum likelihood estimate, which is simply .
Example: Let the text be “abac”. Given that the last letter was “a”, what is the probability that the next letter is “b”?
The parameter is and the input is .
Then the probability is .
This formal notation based on the categorical distribution can get a bit cumbersome. From now on, I will write the input to the PMFs simply as letters e.g. .
Unigrams (also called -grams or bag-of-words models) are a special case of -grams where each sequence consists of a single word or letter. Unigrams depend only on the current letter.
A unigram does the following approximation:
Hence, it is assumed that are all i.i.d random variables.
In the example, we calculated the probability by dividing the frequency of a single letter by the length of the text.
The unigrams can be used to calculate the probability of certain letters occurring:
Note that i.i.d implies stationarity.
The most common definition of entropy is the following:
For example, the entropy of the word “umulmundumulmum$” is:
In data compression, there exists another definition called -th order empirical entropy:
Empirical entropy makes no direct probabilistic assumptions and requires a text-like structure (i.e. where frequency estimates make sense). is in comparison more general and one needs to specify the probability distribution.
In our example, the variables for are:
Both definitions give the same result.
For a unigram, the entropy rate is defined as follows:
Since are i.i.d random variables, it suffices to calculate the entropy of one .
The same is true for our example. We can simply calculate one of the random variables i.e. .
In this section, we will look at bigrams, which are -grams with . Bigrams depend on the previous letter.
A bigram does the following approximation:
Hence, bigrams satisfy the Markov property. We previously assumed i.i.d random variables. Now, we need stationarity (also known as time-homogeneity in the context of Markov chains):
In the example, the frequency was calculated by generating all possible contiguous sequences of two in the text. All sequences with the same initial letter in the table (e.g. “m” of “mu”) form a probability distribution.
Let us calculate the probabilities of some letters:
Compare this with the i.i.d assumption from the unigram. By using bigrams, we can take into account the context in which the letter occurs. As a result, unigrams and bigrams have different probabilities.
Since we satisfy the Markov property, it is easy to visualize the bigrams. Let us draw the Markov chain of the underlying stochastic process.
The code will output the following graph:
When a Markov chain is time-homogeneous, irreducible and aperiodic, there exists a unique stationary distribution . The chain in the graph is aperiodic, because we satisfy . However, it is not yet irreducible. By adding a transition from “$” to “U”, we can also make it irreducible.
The -th order entropy can be defined as follows:
We can apply this formula on our example:
In data compression, the following definition is more common:
For example, when :
Again both definitions give the same result.
The entropy rate of a bigram is:
where is the transition matrix of the time-homogeneous Markov chain and is the corresponding stationary distribution.
Since the stochastic process is a Markov chain, the entropy rate is again just the entropy. In the example, .
To verify this result, let us add the transition to our Markov chain. Now, can be calculated:
The code calculates the left eigenvectors and chooses the one with the highest eigenvalue as . This highest eigenvalue will be because the matrix is stochastic.
Note that and . Compare these values with our previous entropy calculation. Hence, we found the same result.
Before we only considered the case of -grams with or , but can be any number:
For example, we can condition on and predict the th letter:
The choice of depends on many factors, but in general low values like , or are more popular. When we have a vocabulary of unique words, then there are possibilities to combine these words. Hence, higher values of require bigger corpora.
To calculate the -th order entropy, one simply conditions on random variables:
The definition for data compression is similar to the previous one:
Instead of manually calculating the -th order entropy for all , we can also use a suffix tree (see this article).
The following code constructs a compressed suffix tree by using the library sdsl-lite and then calculates the entropy for :
Compile the code with
g++ -std=c++11 -O3 -DNDEBUG -I ~/include -L ~/lib entropy.cpp -o entropy -lsdsl -ldivsufsort -ldivsufsort64
We obtain as result:
We can see that conditioning reduces entropy, thus .
The entropy rate is:
The equality follows from stationarity.
When is finite, calculating the entropy rate does not make a lot of sense. We saw before for i.i.d random variables and a specific it is just entropy.
However, when the stochastic process is not stationary e.g.
or when we take -grams with . Then we can get different results (as long as the limit exists).
 T. Cover and J. Thomas, Elements of Information Theory. Hoboken, N.J: Wiley-Interscience, 2006.