# My personal blog

Machine learning, computer vision, languages

# Operations on contextual word embeddings

16 Jun 2021

A variety of operations can be performed on (contextual) word vectors. In this blog post, I will implement some common operations using PyTorch and Python.

First, we need to create the word embeddings. This is done in the section “introduction” using the transformers library. The following section “Operations” then deals with the topic of this blog post.

## Introduction

Word embeddings project high dimensional word vectors onto a low dimensional space. More mathematically, we could describe a word embedding as an injective function $$f : \{0, 1\}^n \to \mathbb{R}^k$$ where $$n \gg k$$. For example, GloVe (with Wikipedia 2014 + Gigaword 5) uses $$n = 4 \cdot 10^{5}$$ and $$k = 300$$.

There are mainly two types of word embeddings:

1. Regular word embeddings like SGNS (word2vec) or GloVe look at each word individually without considering the context. It was shown by [1] that this particular case can be viewed as factorizing a word-context matrix.

2. Sentence embeddings like BERT require the whole text as input. The same two words $$v, w \in \{0, 1\}^n$$ at different positions are mapped to different $$f(v), f(w) \in \mathbb{R}^k$$.

Since there are already good introductions to regular embeddings, we will only consider the case of contextual word embeddings here.

There are many transformer models but most of them use $$k = 768$$ for the output space. This is too big for doing quick tests. For this reason, we choose a more compact model called google/bert_uncased_L-12_H-128_A-2 with only $$k = 128$$ [2].

In addition to a transformer model, embeddings also require a large number of sentences to extract individual words. As a place holder, I came up with a few sentences and put them in a list sentences. This variable should be filled with real sentences.

Finally, we can look at some code. We want to do three things: encode each sentence, remove the padding “[PAD]” and assign a sentence ID to each token.

The output of the code is an $$m \times 128$$ embedding matrix $$X$$ that contains all $$m$$ words from each sentence. Each sentence in $$X$$ starts with the word [CLS] and ends with [SEP]. It is also possible to remove these two placeholder characters to save memory. Note that the variable sentences contains only three debug sentences, I am using a corpus of about 3000 sentences to get good results. Then $$m \gg 128$$ (more words than embedding dimensions).

Having created the embeddings $$X$$, we can look at some operations.

## Operations

The outputs tokens, sent_ids, X of the previous section are needed for this section. I am going to present the following operations:

• Dot product: compute the similarity between two words
• Linear combination: decompose a word into its subcomponents
• Isotropy: postprocess the embeddings to make the words more “spread out” in space

### Dot product

The first operation is defined as follows:

$a^Tb = ||a|| ||b|| \cos\theta \iff \frac{a^Tb}{||a|| ||b||} = \cos\theta$

This operation is also called cosine similarity. It measures the similarity between two words e.g. a = “cold” and b = “warm”.

The following code measures the similarity between the word “cold” (in a particular sentence) and all other words in the embedding matrix $$X$$ (all sentences).

An output for a text dataset with around 700000 tokens is as follows:

• Source sentence: “It was alone in the cold world of frost and snow.”
• Closest token to “cold”: “cold” (similarity 0.92475)
• Found sentence: “It is a very cold country, and very rocky; and there are a great many small islands all around it.”

Since words tend to occur more than once, the highest similarity is the word “cold” of another sentence.

For the next test, I removed all occurrences of “cold” and used another sentence.

• Source sentence: “Examples of the former case are giant molecular clouds, the coldest, densest phase of interstellar gas, which can form by the cooling and condensation of more diffuse gas.”
• Closest token to “cold”: “warm” (similarity 0.8814)
• Found sentence: “Paleontologists used to believe that dinosaurs lived only in the warmest parts of the world.”

### Linear combination

Writing a vector in terms of other vectors is called a linear combination or a superposition. We want to decompose a word into its subcomponents. More formally, we can write

$w = a_{1}v_{1} + a_2v_{2}+a_{3}v_{3}+\cdots +a_mv_{m}\,,$

where $$w, v_1, \dots, v_m \in \mathbb{R}^{128}$$ are words and $$a_1, \dots, a_m \in \mathbb{R}$$. For example: $$\text{winter} = 0.7\cdot\text{autumn} + 0.3\cdot\text{summer}$$. Note that the factors $$a_1, \dots, a_m$$ do not have to sum to $$1$$ (except if one enforces it).

The first step for finding the components is to apply the transpose operation on the embedding matrix $$X$$. Then the original $$m \times 128$$ matrix turns into a $$128 \times m$$ matrix. The objective is to multiply this $$X^T$$ matrix with an $$m \times 1$$ vector $$x$$ where most entries are zero (sparse):

$X^Tx = y$

The equation expresses the above $$0.7\cdot\text{autumn} + 0.3\cdot\text{summer} = \text{winter}$$ relationship in matrix form.

We are dealing here with an underdetermined system of equations because we have fewer equations than unknowns. Such a system tends to have an infinite number of solutions, that is, there is more than one way to decompose a word. For the same reason, linear regression will not work here. Instead the usual approach to find $$x$$ (or in other words $$a_1, \dots, a_m$$) is “sparse dictionary learning” [3, 4]. The paper [4] uses the FISTA algorithm for this purpose. This basically amounts to using regression with $$L_1$$ regularization which is also known as LASSO.

We simplify matters by only decomposing a single word into its subcomponents and apply regular LASSO regression.

The results are as follows:

\begin{aligned} \text{cold} &= 0.66\cdot\text{warm} + 0.23\cdot\text{heavy} + 0.11\cdot\text{hot}\\ \text{cold} &= 0.60\cdot\text{hot} + 0.30\cdot\text{cool} + 0.1\cdot\text{heavy}\\ \text{cold} &= 0.84\cdot\text{dry} + 0.16\cdot\text{warm}\\ \cdots \end{aligned}

Depending on the sentence, the same word “cold” has different components. In this example, “cold” consists of 2 or 3 components.

Vector addition and subtraction can correspond linguistically to word analogies. The typical example for an analogy is $$\text{king} - \text{man} + \text{woman} = \text{queen}$$ (see word2vec).

The implementation is quite simple. The following code finds the relevant words in the embedding matrix $$X$$ and then computes the dot product to get the word with the highest similarity:

The top-5 results to $$\text{king} - \text{man} + \text{woman}$$ are: queen, king, king, king, king. Another way of writing the above code as equation is:

$\arg\max_{v_i} \left(v_{\text{king}} - v_{\text{man}} + v_{\text{woman}}\right)^Tv_i = v_\text{queen}$

We found that $$i = \text{queen}$$ for the top-1 result. However, it depends also on the sentence because “king” is geometrically not far from “queen”.

### Isotropy

While the last sections were focused on describing the relationship between words in the vector space, this section is about operations on the space itself.

When words are clustered together, it is harder to distinguish them one from another. Intuitively, words should be spread out across the whole space [5, 6, 7, 8, 9]. The desired property is called “isotropy”. A common theme in recent papers is to postprocess the word embeddings so that they are more isotropic.

The postprocessing methods only require changing the embedding matrix itself. Some important papers are the following:

• [5] applied principal component analysis (PCA) on the word embeddings and showed that some components have a stronger influence than others. Removing the leading components gave better results on common datasets. An implementation can be found here.
• [8] improved the previous results by softly filtering out components of the word vectors (no complete removal). An implementation can be found here.
• [9] proposed removing the variance from the embeddings (whitening). An implementation can be found here.

## Conclusion

The operations I introduced in this post mainly focus on things you do with vector spaces: calculate the angle (dot product), combine vectors (linear combination), add vectors (vector addition), or change the basis (e.g. PCA). However, there are also more general operations like clustering words or visualizing the dimensions. I might come back to this post in the future and add more general operations as well.

## References

[1] O. Levy and Y. Goldberg, “Neural Word Embedding as Implicit Matrix Factorization”, 2014.

[2] I. Turc, M.-W. Chang, K. Lee, K. Toutanova, “Well-Read Students Learn Better: On the Importance of Pre-training Compact Models”, 2019.

[3] J. Zhang, Y. Chen, B. Cheung, B. A. Olshausen, “Word Embedding Visualization Via Dictionary Learning”, 2021.

[4] Z. Yun, Y. Chen, B. A. Olshausen, Y. LeCun, “Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors”, 2021.

[5] J. Mu, P. Viswanath, “All-but-the-top: Simple and effective postprocessing for word representations”, 2018.

[6] X. Cai, J. Huang, Y. Bian, K. Church, “Isotropy in the contextual embedding space: Clusters and manifolds”, 2021.

[7] S. Arora, Y. Li, Y. Liang, T. Ma, A. Risteski, “A Latent Variable Model Approach to PMI-based Word Embeddings”, 2016.

[8] T. Liu, L. Ungar, J. Sedoc, “Unsupervised Post-processing of Word Vectors via Conceptor Negation”, 2018.

[9] J. Su, J. Cao, W. L, Y. Ou, “Whitening Sentence Representations for Better Semantics and Faster Retrieval”, 2021.